Artificial Intelligence
Authors and titles for April 2024
Total of 2408 entries :
2-2001
2001-2408
- [2] arXiv:2404.00051 [ pdf , ps , html , other ]
-
Title: Deja vu: Contrastive Historical Modeling with Prefix-tuning for Temporal Knowledge Graph ReasoningComments: Accepted to NAACL 2024 FindingsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Temporal Knowledge Graph Reasoning (TKGR) is the task of inferring missing facts for incomplete TKGs in complex scenarios (e.g., transductive and inductive settings), which has been gaining increasing attention. Recently, to mitigate dependence on structured connections in TKGs, text-based methods have been developed to utilize rich linguistic information from entity descriptions. However, suffering from the enormous parameters and inflexibility of pre-trained language models, existing text-based methods struggle to balance the textual knowledge and temporal information with computationally expensive purpose-built training strategies. To tap the potential of text-based models for TKGR in various complex scenarios, we propose ChapTER, a Contrastive historical modeling framework with prefix-tuning for TEmporal Reasoning. ChapTER feeds history-contextualized text into the pseudo-Siamese encoders to strike a textual-temporal balance via contrastive estimation between queries and candidates. By introducing virtual time prefix tokens, it applies a prefix-based tuning method to facilitate the frozen PLM capable for TKGR tasks under different settings. We evaluate ChapTER on four transductive and three few-shot inductive TKGR benchmarks, and experimental results demonstrate that ChapTER achieves superior performance compared to competitive baselines with only 0.17% tuned parameters. We conduct thorough analysis to verify the effectiveness, flexibility and efficiency of ChapTER.
- [3] arXiv:2404.00099 [ pdf , ps , html , other ]
-
Title: Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision ProcessesComments: 40 pages, 1 figureSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (stat.ML)
Abstract: We study evaluating a policy under best- and worst-case perturbations to a Markov decision process (MDP), given transition observations from the original MDP, whether under the same or different policy. This is an important problem when there is the possibility of a shift between historical and future environments, due to e.g. unmeasured confounding, distributional shift, or an adversarial environment. We propose a perturbation model that can modify transition kernel densities up to a given multiplicative factor or its reciprocal, which extends the classic marginal sensitivity model (MSM) for single time step decision making to infinite-horizon RL. We characterize the sharp bounds on policy value under this model, that is, the tightest possible bounds given by the transition observations from the original MDP, and we study the estimation of these bounds from such transition observations. We develop an estimator with several appealing guarantees: it is semiparametrically efficient, and remains so even when certain necessary nuisance functions such as worst-case Q-functions are estimated at slow nonparametric rates. It is also asymptotically normal, enabling easy statistical inference using Wald confidence intervals. In addition, when certain nuisances are estimated inconsistently we still estimate a valid, albeit possibly not sharp bounds on the policy value. We validate these properties in numeric simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisance-function estimation (orthogonality), and accounting for having only finite samples to learn from (inference) together leads to credible and reliable policy evaluation.
- [4] arXiv:2404.00140 [ pdf , ps , html , other ]
-
Title: Does Faithfulness Conflict with Plausibility? An Empirical Study in Explainable AI across NLP TasksSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Explainability algorithms aimed at interpreting decision-making AI systems usually consider balancing two critical dimensions: 1) \textit{faithfulness}, where explanations accurately reflect the model's inference process. 2) \textit{plausibility}, where explanations are consistent with domain experts. However, the question arises: do faithfulness and plausibility inherently conflict? In this study, through a comprehensive quantitative comparison between the explanations from the selected explainability methods and expert-level interpretations across three NLP tasks: sentiment analysis, intent detection, and topic labeling, we demonstrate that traditional perturbation-based methods Shapley value and LIME could attain greater faithfulness and plausibility. Our findings suggest that rather than optimizing for one dimension at the expense of the other, we could seek to optimize explainability algorithms with dual objectives to achieve high levels of accuracy and user accessibility in their explanations.
- [5] arXiv:2404.00276 [ pdf , ps , html , other ]
-
Title: Instruction-Driven Game Engines on Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: The Instruction-Driven Game Engine (IDGE) project aims to democratize game development by enabling a large language model (LLM) to follow free-form game rules and autonomously generate game-play processes. The IDGE allows users to create games by issuing simple natural language instructions, which significantly lowers the barrier for game development. We approach the learning process for IDGEs as a Next State Prediction task, wherein the model autoregressively predicts in-game states given player actions. It is a challenging task because the computation of in-game states must be precise; otherwise, slight errors could disrupt the game-play. To address this, we train the IDGE in a curriculum manner that progressively increases the model's exposure to complex scenarios. Our initial progress lies in developing an IDGE for Poker, a universally cherished card game. The engine we've designed not only supports a wide range of poker variants but also allows for high customization of rules through natural language inputs. Furthermore, it also favors rapid prototyping of new games from minimal samples, proposing an innovative paradigm in game development that relies on minimal prompt and data engineering. This work lays the groundwork for future advancements in instruction-driven game creation, potentially transforming how games are designed and played.
- [6] arXiv:2404.00320 [ pdf , ps , html , other ]
-
Title: Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered PerspectivesComments: Under reviewed by ACII 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: This research tackles the challenge of integrating heterogeneous data for specific behavior recognition within the domain of Pain Recognition, presenting a novel methodology that harmonizes statistical correlations with a human-centered approach. By leveraging a diverse range of deep learning architectures, we highlight the adaptability and efficacy of our approach in improving model performance across various complex scenarios. The novelty of our methodology is the strategic incorporation of statistical relevance weights and the segmentation of modalities from a human-centric perspective, enhancing model precision and providing a explainable analysis of multimodal data. This study surpasses traditional modality fusion techniques by underscoring the role of data diversity and customized modality segmentation in enhancing pain behavior analysis. Introducing a framework that matches each modality with an suited classifier, based on the statistical significance, signals a move towards customized and accurate multimodal fusion strategies. Our contributions extend beyond the field of Pain Recognition by delivering new insights into modality fusion and human-centered computing applications, contributing towards explainable AI and bolstering patient-centric healthcare interventions. Thus, we bridge a significant void in the effective and interpretable fusion of multimodal data, establishing a novel standard for forthcoming inquiries in pain behavior recognition and allied fields.
- [7] arXiv:2404.00341 [ pdf , ps , other ]
-
Title: Ontology in Holonic Cooperative Manufacturing: A Solution to Share and Exchange the KnowledgeSubjects: Artificial Intelligence (cs.AI)
Abstract: Cooperative manufacturing is a new trend in industry, which depends on the existence of a collaborative robot. A collaborative robot is usually a light-weight robot which is capable of operating safely with a human co-worker in a shared work environment. During this cooperation, a vast amount of information is exchanged between the collaborative robot and the worker. This information constructs the cooperative manufacturing knowledge, which describes the production components and environment. In this research, we propose a holonic control solution, which uses the ontology concept to represent the cooperative manufacturing knowledge. The holonic control solution is implemented as an autonomous multi-agent system that exchanges the manufacturing knowledge based on an ontology model. Ultimately, the research illustrates and implements the proposed solution over a cooperative assembly scenario, which involves two workers and one collaborative robot, whom cooperate together to assemble a customized product.
- [8] arXiv:2404.00560 [ pdf , ps , html , other ]
-
Title: A Theory for Length Generalization in Learning to ReasonComments: arXiv admin note: text overlap with arXiv:2311.16173Subjects: Artificial Intelligence (cs.AI)
Abstract: Length generalization (LG) is a challenging problem in learning to reason. It refers to the phenomenon that when trained on reasoning problems of smaller lengths or sizes, the resulting model struggles with problems of larger sizes or lengths. Although LG has been studied by many researchers, the challenge remains. This paper proposes a theoretical study of LG for problems whose reasoning processes can be modeled as DAGs (directed acyclic graphs). The paper first identifies and proves the conditions under which LG can be achieved in learning to reason. It then designs problem representations based on the theory to learn to solve challenging reasoning problems like parity, addition, and multiplication, using a Transformer to achieve perfect LG.
- [9] arXiv:2404.00586 [ pdf , ps , html , other ]
-
Title: RLGNet: Repeating-Local-Global History Network for Temporal Knowledge Graph ReasoningSubjects: Artificial Intelligence (cs.AI)
Abstract: Temporal Knowledge Graph (TKG) reasoning is based on historical information to predict the future. Therefore, parsing and mining historical information is key to predicting the future. Most existing methods fail to concurrently address and comprehend historical information from both global and local perspectives. Neglecting the global view might result in overlooking macroscopic trends and patterns, while ignoring the local view can lead to missing critical detailed information. Additionally, some methods do not focus on learning from high-frequency repeating events, which means they may not fully grasp frequently occurring historical events. To this end, we propose the \textbf{R}epetitive-\textbf{L}ocal-\textbf{G}lobal History \textbf{Net}work(RLGNet). We utilize a global history encoder to capture the overarching nature of historical information. Subsequently, the local history encoder provides information related to the query timestamp. Finally, we employ the repeating history encoder to identify and learn from frequently occurring historical events. In the evaluation on six benchmark datasets, our approach generally outperforms existing TKG reasoning models in multi-step and single-step reasoning tasks.
- [10] arXiv:2404.00756 [ pdf , ps , html , other ]
-
Title: Recover: A Neuro-Symbolic Framework for Failure Detection and RecoverySubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Robotics (cs.RO)
Abstract: Recognizing failures during task execution and implementing recovery procedures is challenging in robotics. Traditional approaches rely on the availability of extensive data or a tight set of constraints, while more recent approaches leverage large language models (LLMs) to verify task steps and replan accordingly. However, these methods often operate offline, necessitating scene resets and incurring in high costs. This paper introduces Recover, a neuro-symbolic framework for online failure identification and recovery. By integrating ontologies, logical rules, and LLM-based planners, Recover exploits symbolic information to enhance the ability of LLMs to generate recovery plans and also to decrease the associated costs. In order to demonstrate the capabilities of our method in a simulated kitchen environment, we introduce OntoThor, an ontology describing the AI2Thor simulator setting. Empirical evaluation shows that OntoThor's logical rules accurately detect all failures in the analyzed tasks, and that Recover considerably outperforms, for both failure detection and recovery, a baseline method reliant solely on LLMs.
- [11] arXiv:2404.00886 [ pdf , ps , html , other ]
-
Title: MTLight: Efficient Multi-Task Reinforcement Learning for Traffic Signal ControlSubjects: Artificial Intelligence (cs.AI)
Abstract: Traffic signal control has a great impact on alleviating traffic congestion in modern cities. Deep reinforcement learning (RL) has been widely used for this task in recent years, demonstrating promising performance but also facing many challenges such as limited performances and sample inefficiency. To handle these challenges, MTLight is proposed to enhance the agent observation with a latent state, which is learned from numerous traffic indicators. Meanwhile, multiple auxiliary and supervisory tasks are constructed to learn the latent state, and two types of embedding latent features, the task-specific feature and task-shared feature, are used to make the latent state more abundant. Extensive experiments conducted on CityFlow demonstrate that MTLight has leading convergence speed and asymptotic performance. We further simulate under peak-hour pattern in all scenarios with increasing control difficulty and the results indicate that MTLight is highly adaptable.
- [12] arXiv:2404.01266 [ pdf , ps , html , other ]
-
Title: IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic RepresentationsDeqing Fu , Ghazal Khalighinejad , Ollie Liu , Bhuwan Dhingra , Dani Yogatama , Robin Jia , Willie NeiswangerSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.
- [13] arXiv:2404.01308 [ pdf , ps , html , other ]
-
Title: Learning to Solve Job Shop Scheduling under UncertaintyComments: To be published at CPAIOR 2024Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: Job-Shop Scheduling Problem (JSSP) is a combinatorial optimization problem where tasks need to be scheduled on machines in order to minimize criteria such as makespan or delay. To address more realistic scenarios, we associate a probability distribution with the duration of each task. Our objective is to generate a robust schedule, i.e. that minimizes the average makespan. This paper introduces a new approach that leverages Deep Reinforcement Learning (DRL) techniques to search for robust solutions, emphasizing JSSPs with uncertain durations. Key contributions of this research include: (1) advancements in DRL applications to JSSPs, enhancing generalization and scalability, (2) a novel method for addressing JSSPs with uncertain durations. The Wheatley approach, which integrates Graph Neural Networks (GNNs) and DRL, is made publicly available for further research and applications.
- [14] arXiv:2404.01503 [ pdf , ps , html , other ]
-
Title: Some Orders Are Important: Partially Preserving Orders in Top-Quality PlanningComments: To appear at SoCS 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: The ability to generate multiple plans is central to using planning in real-life applications. Top-quality planners generate sets of such top-cost plans, allowing flexibility in determining equivalent ones. In terms of the order between actions in a plan, the literature only considers two extremes -- either all orders are important, making each plan unique, or all orders are unimportant, treating two plans differing only in the order of actions as equivalent. To allow flexibility in selecting important orders, we propose specifying a subset of actions the orders between which are important, interpolating between the top-quality and unordered top-quality planning problems. We explore the ways of adapting partial order reduction search pruning techniques to address this new computational problem and present experimental evaluations demonstrating the benefits of exploiting such techniques in this setting.
- [15] arXiv:2404.01526 [ pdf , ps , html , other ]
-
Title: Categorical semiotics: Foundations for Knowledge IntegrationComments: 71 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:1604.02790Subjects: Artificial Intelligence (cs.AI)
Abstract: The integration of knowledge extracted from diverse models, whether described by domain experts or generated by machine learning algorithms, has historically been challenged by the absence of a suitable framework for specifying and integrating structures, learning processes, data transformations, and data models or rules. In this work, we extend algebraic specification methods to address these challenges within such a framework.
In our work, we tackle the challenging task of developing a comprehensive framework for defining and analyzing deep learning architectures. We believe that previous efforts have fallen short by failing to establish a clear connection between the constraints a model must adhere to and its actual implementation.
Our methodology employs graphical structures that resemble Ehresmann's sketches, interpreted within a universe of fuzzy sets. This approach offers a unified theory that elegantly encompasses both deterministic and non-deterministic neural network designs. Furthermore, we highlight how this theory naturally incorporates fundamental concepts from computer science and automata theory. Our extended algebraic specification framework, grounded in graphical structures akin to Ehresmann's sketches, offers a promising solution for integrating knowledge across disparate models and domains. By bridging the gap between domain-specific expertise and machine-generated insights, we pave the way for more comprehensive, collaborative, and effective approaches to knowledge integration and modeling. - [16] arXiv:2404.01677 [ pdf , ps , html , other ]
-
Title: Towards Generalizable and Faithful Logic Reasoning over Natural Language via Resolution RefutationComments: LREC-Coling 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large language models (LLMs) have achieved significant performance in various natural language reasoning tasks. However, they still struggle with performing first-order logic reasoning over formal logical theories expressed in natural language. This is because the previous LLMs-based reasoning systems have the theoretical incompleteness issue. As a result, it can only address a limited set of simple reasoning problems, which significantly decreases their generalization ability. To address this issue, we propose a novel framework, named Generalizable and Faithful Reasoner (GFaiR), which introduces the paradigm of resolution refutation. Resolution refutation has the capability to solve all first-order logic reasoning problems by extending reasoning rules and employing the principle of proof by contradiction, so our system's completeness can be improved by introducing resolution refutation. Experimental results demonstrate that our system outperforms previous works by achieving state-of-the-art performances in complex scenarios while maintaining performances in simple scenarios. Besides, we observe that GFaiR is faithful to its reasoning process.
- [17] arXiv:2404.01794 [ pdf , ps , html , other ]
-
Title: Imitation Game: A Model-based and Imitation Learning Deep Reinforcement Learning HybridComments: Accepted as publication at MSCPES '24Subjects: Artificial Intelligence (cs.AI)
Abstract: Autonomous and learning systems based on Deep Reinforcement Learning have firmly established themselves as a foundation for approaches to creating resilient and efficient Cyber-Physical Energy Systems. However, most current approaches suffer from two distinct problems: Modern model-free algorithms such as Soft Actor Critic need a high number of samples to learn a meaningful policy, as well as a fallback to ward against concept drifts (e. g., catastrophic forgetting). In this paper, we present the work in progress towards a hybrid agent architecture that combines model-based Deep Reinforcement Learning with imitation learning to overcome both problems.
- [18] arXiv:2404.02039 [ pdf , ps , html , other ]
-
Title: A Survey on Large Language Model-Based Game AgentsSubjects: Artificial Intelligence (cs.AI)
Abstract: The development of game agents holds a critical role in advancing towards Artificial General Intelligence (AGI). The progress of LLMs and their multimodal counterparts (MLLMs) offers an unprecedented opportunity to evolve and empower game agents with human-like decision-making capabilities in complex computer game environments. This paper provides a comprehensive overview of LLM-based game agents from a holistic viewpoint. First, we introduce the conceptual architecture of LLM-based game agents, centered around six essential functional components: perception, memory, thinking, role-playing, action, and learning. Second, we survey existing representative LLM-based game agents documented in the literature with respect to methodologies and adaptation agility across six genres of games, including adventure, communication, competition, cooperation, simulation, and crafting & exploration games. Finally, we present an outlook of future research and development directions in this burgeoning field. A curated list of relevant papers is maintained and made accessible at: this https URL .
- [19] arXiv:2404.02078 [ pdf , ps , html , other ]
-
Title: Advancing LLM Reasoning Generalists with Preference TreesLifan Yuan , Ganqu Cui , Hanbin Wang , Ning Ding , Xingyao Wang , Jia Deng , Boji Shan , Huimin Chen , Ruobing Xie , Yankai Lin , Zhenghao Liu , Bowen Zhou , Hao Peng , Zhiyuan Liu , Maosong SunComments: Models and data are available at this https URLSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.
- [20] arXiv:2404.02454 [ pdf , ps , html , other ]
-
Title: Techniques for Measuring the Inferential Strength of Forgetting PoliciesSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: The technique of forgetting in knowledge representation has been shown to be a powerful and useful knowledge engineering tool with widespread application. Yet, very little research has been done on how different policies of forgetting, or use of different forgetting operators, affects the inferential strength of the original theory. The goal of this paper is to define loss functions for measuring changes in inferential strength based on intuitions from model counting and probability theory. Properties of such loss measures are studied and a pragmatic knowledge engineering tool is proposed for computing loss measures using Problog. The paper includes a working methodology for studying and determining the strength of different forgetting policies, in addition to concrete examples showing how to apply the theoretical results using Problog. Although the focus is on forgetting, the results are much more general and should have wider application to other areas.
- [21] arXiv:2404.02499 [ pdf , ps , html , other ]
-
Title: Learning Generalized Policies for Fully Observable Non-Deterministic Planning DomainsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: General policies represent reactive strategies for solving large families of planning problems like the infinite collection of solvable instances from a given domain. Methods for learning such policies from a collection of small training instances have been developed successfully for classical domains. In this work, we extend the formulations and the resulting combinatorial methods for learning general policies over fully observable, non-deterministic (FOND) domains. We also evaluate the resulting approach experimentally over a number of benchmark domains in FOND planning, present the general policies that result in some of these domains, and prove their correctness. The method for learning general policies for FOND planning can actually be seen as an alternative FOND planning method that searches for solutions, not in the given state space but in an abstract space defined by features that must be learned as well.
- [22] arXiv:2404.02532 [ pdf , ps , html , other ]
-
Title: Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser GameComments: 13 pages, 2 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.
- [23] arXiv:2404.02579 [ pdf , ps , html , other ]
-
Title: Learning Alternative Ways of Performing a TaskComments: 32 pages, Github repository, published paper, authors' versionJournal-ref: Expert Systems With Applications, volume 148, 2020, 113263Subjects: Artificial Intelligence (cs.AI)
Abstract: A common way of learning to perform a task is to observe how it is carried out by experts. However, it is well known that for most tasks there is no unique way to perform them. This is especially noticeable the more complex the task is because factors such as the skill or the know-how of the expert may well affect the way she solves the task. In addition, learning from experts also suffers of having a small set of training examples generally coming from several experts (since experts are usually a limited and expensive resource), being all of them positive examples (i.e. examples that represent successful executions of the task). Traditional machine learning techniques are not useful in such scenarios, as they require extensive training data. Starting from very few executions of the task presented as activity sequences, we introduce a novel inductive approach for learning multiple models, with each one representing an alternative strategy of performing a task. By an iterative process based on generalisation and specialisation, we learn the underlying patterns that capture the different styles of performing a task exhibited by the examples. We illustrate our approach on two common activity recognition tasks: a surgical skills training task and a cooking domain. We evaluate the inferred models with respect to two metrics that measure how well the models represent the examples and capture the different forms of executing a task showed by the examples. We compare our results with the traditional process mining approach and show that a small set of meaningful examples is enough to obtain patterns that capture the different strategies that are followed to solve the tasks.
- [24] arXiv:2404.02611 [ pdf , ps , html , other ]
-
Title: SHIELD: A regularization technique for eXplainable Artificial IntelligenceComments: 18 pages, 8 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: As Artificial Intelligence systems become integral across domains, the demand for explainability grows. While the effort by the scientific community is focused on obtaining a better explanation for the model, it is important not to ignore the potential of this explanation process to improve training as well. While existing efforts primarily focus on generating and evaluating explanations for black-box models, there remains a critical gap in directly enhancing models through these evaluations. This paper introduces SHIELD (Selective Hidden Input Evaluation for Learning Dynamics), a regularization technique for explainable artificial intelligence designed to improve model quality by concealing portions of input data and assessing the resulting discrepancy in predictions. In contrast to conventional approaches, SHIELD regularization seamlessly integrates into the objective function, enhancing model explainability while also improving performance. Experimental validation on benchmark datasets underscores SHIELD's effectiveness in improving Artificial Intelligence model explainability and overall performance. This establishes SHIELD regularization as a promising pathway for developing transparent and reliable Artificial Intelligence regularization techniques.
- [25] arXiv:2404.02831 [ pdf , ps , html , other ]
-
Title: Empowering Biomedical Discovery with AI AgentsShanghua Gao , Ada Fang , Yepeng Huang , Valentina Giunchiglia , Ayush Noori , Jonathan Richard Schwarz , Yasha Ektefaie , Jovana Kondic , Marinka ZitnikSubjects: Artificial Intelligence (cs.AI)
Abstract: We envision 'AI scientists' as systems capable of skeptical learning and reasoning that empower biomedical research through collaborative agents that integrate machine learning tools with experimental platforms. Rather than taking humans out of the discovery process, biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets, navigate hypothesis spaces, and execute repetitive tasks. AI agents are proficient in a variety of tasks, including self-assessment and planning of discovery workflows. These agents use large language models and generative models to feature structured memory for continual learning and use machine learning tools to incorporate scientific knowledge, biological principles, and theories. AI agents can impact areas ranging from hybrid cell simulation, programmable control of phenotypes, and the design of cellular circuits to the development of new therapies.
- [26] arXiv:2404.02838 [ pdf , ps , html , other ]
-
Title: I-Design: Personalized LLM Interior DesignerSubjects: Artificial Intelligence (cs.AI)
Abstract: Interior design allows us to be who we are and live how we want - each design is as unique as our distinct personality. However, it is not trivial for non-professionals to express and materialize this since it requires aligning functional and visual expectations with the constraints of physical space; this renders interior design a luxury. To make it more accessible, we present I-Design, a personalized interior designer that allows users to generate and visualize their design goals through natural language communication. I-Design starts with a team of large language model agents that engage in dialogues and logical reasoning with one another, transforming textual user input into feasible scene graph designs with relative object relationships. Subsequently, an effective placement algorithm determines optimal locations for each object within the scene. The final design is then constructed in 3D by retrieving and integrating assets from an existing object database. Additionally, we propose a new evaluation protocol that utilizes a vision-language model and complements the design pipeline. Extensive quantitative and qualitative experiments show that I-Design outperforms existing methods in delivering high-quality 3D design solutions and aligning with abstract concepts that match user input, showcasing its advantages across detailed 3D arrangement and conceptual fidelity.
- [27] arXiv:2404.02872 [ pdf , ps , html , other ]
-
Title: Integrating Explanations in Learning LTL Specifications from DemonstrationsAshutosh Gupta , John Komp , Abhay Singh Rajput , Krishna Shankaranarayanan , Ashutosh Trivedi , Namrita VarshneyComments: 21 Pages, 13 Page AppendixSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper investigates whether recent advances in Large Language Models (LLMs) can assist in translating human explanations into a format that can robustly support learning Linear Temporal Logic (LTL) from demonstrations. Both LLMs and optimization-based methods can extract LTL specifications from demonstrations; however, they have distinct limitations. LLMs can quickly generate solutions and incorporate human explanations, but their lack of consistency and reliability hampers their applicability in safety-critical domains. On the other hand, optimization-based methods do provide formal guarantees but cannot process natural language explanations and face scalability challenges. We present a principled approach to combining LLMs and optimization-based methods to faithfully translate human explanations and demonstrations into LTL specifications. We have implemented a tool called Janaka based on our approach. Our experiments demonstrate the effectiveness of combining explanations with demonstrations in learning LTL specifications through several case studies.
- [28] arXiv:2404.03054 [ pdf , ps , html , other ]
-
Title: Data-Driven Goal Recognition Design for General Behavioral AgentsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Goal recognition design aims to make limited modifications to decision-making environments with the goal of making it easier to infer the goals of agents acting within those environments. Although various research efforts have been made in goal recognition design, existing approaches are computationally demanding and often assume that agents are (near-)optimal in their decision-making. To address these limitations, we introduce a data-driven approach to goal recognition design that can account for agents with general behavioral models. Following existing literature, we use worst-case distinctiveness ($\textit{wcd}$) as a measure of the difficulty in inferring the goal of an agent in a decision-making environment. Our approach begins by training a machine learning model to predict the $\textit{wcd}$ for a given environment and the agent behavior model. We then propose a gradient-based optimization framework that accommodates various constraints to optimize decision-making environments for enhanced goal recognition. Through extensive simulations, we demonstrate that our approach outperforms existing methods in reducing $\textit{wcd}$ and enhancing runtime efficiency in conventional setups, and it also adapts to scenarios not previously covered in the literature, such as those involving flexible budget constraints, more complex environments, and suboptimal agent behavior. Moreover, we have conducted human-subject experiments which confirm that our method can create environments that facilitate efficient goal recognition from real-world human decision-makers.
- [29] arXiv:2404.03441 [ pdf , ps , html , other ]
-
Title: Benchmarking ChatGPT on Algorithmic ReasoningSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We evaluate ChatGPT's ability to solve algorithm problems from the CLRS benchmark suite that is designed for GNNs. The benchmark requires the use of a specified classical algorithm to solve a given problem. We find that ChatGPT outperforms specialist GNN models, using Python to successfully solve these problems. This raises new points in the discussion about learning algorithms with neural networks and how we think about what out of distribution testing looks like with web scale training data.
- [30] arXiv:2404.03499 [ pdf , ps , html , other ]
-
Title: Comprehensible Artificial Intelligence on Knowledge Graphs: A surveySubjects: Artificial Intelligence (cs.AI)
Abstract: Artificial Intelligence applications gradually move outside the safe walls of research labs and invade our daily lives. This is also true for Machine Learning methods on Knowledge Graphs, which has led to a steady increase in their application since the beginning of the 21st century. However, in many applications, users require an explanation of the Artificial Intelligences decision. This led to increased demand for Comprehensible Artificial Intelligence. Knowledge Graphs epitomize fertile soil for Comprehensible Artificial Intelligence, due to their ability to display connected data, i.e. knowledge, in a human- as well as machine-readable way. This survey gives a short history to Comprehensible Artificial Intelligence on Knowledge Graphs. Furthermore, we contribute by arguing that the concept Explainable Artificial Intelligence is overloaded and overlapping with Interpretable Machine Learning. By introducing the parent concept Comprehensible Artificial Intelligence, we provide a clear-cut distinction of both concepts while accounting for their similarities. Thus, we provide in this survey a case for Comprehensible Artificial Intelligence on Knowledge Graphs consisting of Interpretable Machine Learning on Knowledge Graphs and Explainable Artificial Intelligence on Knowledge Graphs. This leads to the introduction of a novel taxonomy for Comprehensible Artificial Intelligence on Knowledge Graphs. In addition, a comprehensive overview of the research on Comprehensible Artificial Intelligence on Knowledge Graphs is presented and put into the context of the taxonomy. Finally, research gaps in the field of Comprehensible Artificial Intelligence on Knowledge Graphs are identified for future research.
- [31] arXiv:2404.03502 [ pdf , ps , html , other ]
-
Title: AI and the Problem of Knowledge CollapseComments: 37 pages, 9 figuresSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: While artificial intelligence has the potential to process vast amounts of data, generate new insights, and unlock greater productivity, its widespread adoption may entail unforeseen consequences. We identify conditions under which AI, by reducing the cost of access to certain modes of knowledge, can paradoxically harm public understanding. While large language models are trained on vast amounts of diverse data, they naturally generate output towards the 'center' of the distribution. This is generally useful, but widespread reliance on recursive AI systems could lead to a process we define as "knowledge collapse", and argue this could harm innovation and the richness of human understanding and culture. However, unlike AI models that cannot choose what data they are trained on, humans may strategically seek out diverse forms of knowledge if they perceive them to be worthwhile. To investigate this, we provide a simple model in which a community of learners or innovators choose to use traditional methods or to rely on a discounted AI-assisted process and identify conditions under which knowledge collapse occurs. In our default model, a 20% discount on AI-generated content generates public beliefs 2.3 times further from the truth than when there is no discount. An empirical approach to measuring the distribution of LLM outputs is provided in theoretical terms and illustrated through a specific example comparing the diversity of outputs across different models and prompting styles. Finally, based on the results, we consider further research directions to counteract such outcomes.
- [32] arXiv:2404.03624 [ pdf , ps , html , other ]
-
Title: Standardizing Knowledge Engineering Practices with a Reference ArchitectureComments: 23 pages, 4 figures, 2 tables, camera-ready version, accepted for Transactions on Graph Data and Knowledge (TGDK)Subjects: Artificial Intelligence (cs.AI) ; Software Engineering (cs.SE)
Abstract: Knowledge engineering is the process of creating and maintaining knowledge-producing systems. Throughout the history of computer science and AI, knowledge engineering workflows have been widely used given the importance of high-quality knowledge for reliable intelligent agents. Meanwhile, the scope of knowledge engineering, as apparent from its target tasks and use cases, has been shifting, together with its paradigms such as expert systems, semantic web, and language modeling. The intended use cases and supported user requirements between these paradigms have not been analyzed globally, as new paradigms often satisfy prior pain points while possibly introducing new ones. The recent abstraction of systemic patterns into a boxology provides an opening for aligning the requirements and use cases of knowledge engineering with the systems, components, and software that can satisfy them best. This paper proposes a vision of harmonizing the best practices in the field of knowledge engineering by leveraging the software engineering methodology of creating reference architectures. We describe how a reference architecture can be iteratively designed and implemented to associate user needs with recurring systemic patterns, building on top of existing knowledge engineering workflows and boxologies. We provide a six-step roadmap that can enable the development of such an architecture, providing an initial design and outcome of the definition of architectural scope, selection of information sources, and analysis. We expect that following through on this vision will lead to well-grounded reference architectures for knowledge engineering, will advance the ongoing initiatives of organizing the neurosymbolic knowledge engineering space, and will build new links to the software architectures and data science communities.
- [33] arXiv:2404.03893 [ pdf , ps , html , other ]
-
Title: KGExplainer: Towards Exploring Connected Subgraph Explanations for Knowledge Graph CompletionTengfei Ma , Xiang song , Wen Tao , Mufei Li , Jiani Zhang , Xiaoqin Pan , Jianxin Lin , Bosheng Song , xiangxiang ZengComments: 13 pages, 7 figures, 11 tables. Under ReviewSubjects: Artificial Intelligence (cs.AI)
Abstract: Knowledge graph completion (KGC) aims to alleviate the inherent incompleteness of knowledge graphs (KGs), which is a critical task for various applications, such as recommendations on the web. Although knowledge graph embedding (KGE) models have demonstrated superior predictive performance on KGC tasks, these models infer missing links in a black-box manner that lacks transparency and accountability, preventing researchers from developing accountable models. Existing KGE-based explanation methods focus on exploring key paths or isolated edges as explanations, which is information-less to reason target prediction. Additionally, the missing ground truth leads to these explanation methods being ineffective in quantitatively evaluating explored explanations. To overcome these limitations, we propose KGExplainer, a model-agnostic method that identifies connected subgraph explanations and distills an evaluator to assess them quantitatively. KGExplainer employs a perturbation-based greedy search algorithm to find key connected subgraphs as explanations within the local structure of target predictions. To evaluate the quality of the explored explanations, KGExplainer distills an evaluator from the target KGE model. By forwarding the explanations to the evaluator, our method can examine the fidelity of them. Extensive experiments on benchmark datasets demonstrate that KGExplainer yields promising improvement and achieves an optimal ratio of 83.3% in human evaluation.
- [34] arXiv:2404.03978 [ pdf , ps , html , other ]
-
Title: Random Walk in Random Permutation Set TheoryComments: 27 pages, 8 figures; references addedSubjects: Artificial Intelligence (cs.AI) ; Information Theory (cs.IT)
Abstract: Random walk is an explainable approach for modeling natural processes at the molecular level. The Random Permutation Set Theory (RPST) serves as a framework for uncertainty reasoning, extending the applicability of Dempster-Shafer Theory. Recent explorations indicate a promising link between RPST and random walk. In this study, we conduct an analysis and construct a random walk model based on the properties of RPST, with Monte Carlo simulations of such random walk. Our findings reveal that the random walk generated through RPST exhibits characteristics similar to those of a Gaussian random walk and can be transformed into a Wiener process through a specific limiting scaling procedure. This investigation establishes a novel connection between RPST and random walk theory, thereby not only expanding the applicability of RPST, but also demonstrating the potential for combining the strengths of both approaches to improve problem-solving abilities.
- [35] arXiv:2404.04106 [ pdf , ps , html , other ]
-
Title: Intervention-Assisted Policy Gradient Methods for Online Stochastic Queuing Network Optimization: Technical ReportComments: 25 pages, 6 FiguresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Deep Reinforcement Learning (DRL) offers a powerful approach to training neural network control policies for stochastic queuing networks (SQN). However, traditional DRL methods rely on offline simulations or static datasets, limiting their real-world application in SQN control. This work proposes Online Deep Reinforcement Learning-based Controls (ODRLC) as an alternative, where an intelligent agent interacts directly with a real environment and learns an optimal control policy from these online interactions. SQNs present a challenge for ODRLC due to the unbounded nature of the queues within the network resulting in an unbounded state-space. An unbounded state-space is particularly challenging for neural network policies as neural networks are notoriously poor at extrapolating to unseen states. To address this challenge, we propose an intervention-assisted framework that leverages strategic interventions from known stable policies to ensure the queue sizes remain bounded. This framework combines the learning power of neural networks with the guaranteed stability of classical control policies for SQNs. We introduce a method to design these intervention-assisted policies to ensure strong stability of the network. Furthermore, we extend foundational DRL theorems for intervention-assisted policies and develop two practical algorithms specifically for ODRLC of SQNs. Finally, we demonstrate through experiments that our proposed algorithms outperform both classical control approaches and prior ODRLC algorithms.
- [36] arXiv:2404.04108 [ pdf , ps , html , other ]
-
Title: Large language models as oracles for instantiating ontologies with domain-specific knowledgeSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Abstract: Background. Endowing intelligent systems with semantic data commonly requires designing and instantiating ontologies with domain-specific knowledge. Especially in the early phases, those activities are typically performed manually by human experts possibly leveraging on their own experience. The resulting process is therefore time-consuming, error-prone, and often biased by the personal background of the ontology designer. Objective. To mitigate that issue, we propose a novel domain-independent approach to automatically instantiate ontologies with domain-specific knowledge, by leveraging on large language models (LLMs) as oracles. Method. Starting from (i) an initial schema composed by inter-related classes andproperties and (ii) a set of query templates, our method queries the LLM multiple times, and generates instances for both classes and properties from its replies. Thus, the ontology is automatically filled with domain-specific knowledge, compliant to the initial schema. As a result, the ontology is quickly and automatically enriched with manifold instances, which experts may consider to keep, adjust, discard, or complement according to their own needs and expertise. Contribution. We formalise our method in general way and instantiate it over various LLMs, as well as on a concrete case study. We report experiments rooted in the nutritional domain where an ontology of food meals and their ingredients is semi-automatically instantiated from scratch, starting from a categorisation of meals and their relationships. There, we analyse the quality of the generated ontologies and compare ontologies attained by exploiting different LLMs. Finally, we provide a SWOT analysis of the proposed method.
- [37] arXiv:2404.04267 [ pdf , ps , other ]
-
Title: What AIs are not Learning (and Why): Bio-Inspired Foundation Models for RobotsComments: 15 pagesSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: It is hard to make robots (including telerobots) that are useful, and harder to make autonomous robots that are robust and general. Current smart robots are created using manual programming, mathematical models, planning frameworks, and reinforcement learning. These methods do not lead to the leaps in performance and generality seen with deep learning, generative AI, and foundation models (FMs). Today's robots do not learn to provide home care, to be nursing assistants, or to do household chores nearly as well as people do. Addressing the aspirational opportunities of robot service applications requires improving how they are created. The high cost of bipedal multi-sensory robots ("bodies") is a significant obstacle for both research and deployment. A deeper issue is that mainstream FMs ("minds") do not support sensing, acting, and learning in context in the real world. They do not lead to robots that communicate well or collaborate. They do not lead to robots that try to learn by experimenting, by asking others, or by imitation learning as appropriate. They do not lead to robots that know enough to be deployed widely in service applications. This paper focuses on what human-compatible service robots need to know. It recommends developing experiential (aka "robotic") FMs for bootstrapping them.
- [38] arXiv:2404.04289 [ pdf , ps , html , other ]
-
Title: Designing for Human-Agent Alignment: Understanding what humans want from their agentsComments: Human-AI Alignment, Human-Agent Alignment, Agents, Generative AI, Large Language ModelsSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Our ability to build autonomous agents that leverage Generative AI continues to increase by the day. As builders and users of such agents it is unclear what parameters we need to align on before the agents start performing tasks on our behalf. To discover these parameters, we ran a qualitative empirical research study about designing agents that can negotiate during a fictional yet relatable task of selling a camera online. We found that for an agent to perform the task successfully, humans/users and agents need to align over 6 dimensions: 1) Knowledge Schema Alignment 2) Autonomy and Agency Alignment 3) Operational Alignment and Training 4) Reputational Heuristics Alignment 5) Ethics Alignment and 6) Human Engagement Alignment. These empirical findings expand previous work related to process and specification alignment and the need for values and safety in Human-AI interactions. Subsequently we discuss three design directions for designers who are imagining a world filled with Human-Agent collaborations.
- [39] arXiv:2404.04298 [ pdf , ps , html , other ]
-
Title: SELF-[IN]CORRECT: LLMs Struggle with Refining Self-Generated ResponsesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Can LLMs continually improve their previous outputs for better results? An affirmative answer would require LLMs to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first introduce a unified framework that allows us to compare the generative and discriminative capability of any model on any task. Then, in our resulting experimental analysis of several LLMs, we do not observe the performance of those models on discrimination to be reliably better than generation. We hope these findings inform the growing literature on self-improvement AI systems.
- [40] arXiv:2404.04308 [ pdf , ps , other ]
-
Title: Visual Knowledge in the Big Model Era: Retrospect and ProspectSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner, with a deep root in cognitive psychology. As the knowledge about the visual world has been identified as an indispensable component of human cognition and intelligence, visual knowledge is poised to have a pivotal role in establishing machine intelligence. With the recent advance of Artificial Intelligence (AI) techniques, large AI models (or foundation models) have emerged as a potent tool capable of extracting versatile patterns from broad data as implicit knowledge, and abstracting them into an outrageous amount of numeric parameters. To pave the way for creating visual knowledge empowered AI machines in this coming wave, we present a timely review that investigates the origins and development of visual knowledge in the pre-big model era, and accentuates the opportunities and unique role of visual knowledge in the big model era.
- [41] arXiv:2404.04326 [ pdf , ps , html , other ]
-
Title: Hypothesis Generation with Large Language ModelsComments: 26 pages, 6 figures, code link: this https URLSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.8% and 11.2% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.
- [42] arXiv:2404.04344 [ pdf , ps , html , other ]
-
Title: A Repository for Formal ContextsComments: 16 pagesSubjects: Artificial Intelligence (cs.AI) ; Digital Libraries (cs.DL)
Abstract: Data is always at the center of the theoretical development and investigation of the applicability of formal concept analysis. It is therefore not surprising that a large number of data sets are repeatedly used in scholarly articles and software tools, acting as de facto standard data sets. However, the distribution of the data sets poses a problem for the sustainable development of the research field. There is a lack of a central location that provides and describes FCA data sets and links them to already known analysis results. This article analyses the current state of the dissemination of FCA data sets, presents the requirements for a central FCA repository, and highlights the challenges for this.
- [43] arXiv:2404.04436 [ pdf , ps , html , other ]
-
Title: AI Knowledge and Reasoning: Emulating Expert Creativity in Scientific ResearchSubjects: Artificial Intelligence (cs.AI)
Abstract: We investigate whether modern AI can emulate expert creativity in complex scientific endeavors. We introduce novel methodology that utilizes original research articles published after the AI's training cutoff, ensuring no prior exposure, mitigating concerns of rote memorization and prior training. The AI are tasked with redacting findings, predicting outcomes from redacted research, and assessing prediction accuracy against reported results. Analysis on 589 published studies in four leading psychology journals over a 28-month period, showcase the AI's proficiency in understanding specialized research, deductive reasoning, and evaluating evidentiary alignment--cognitive hallmarks of human subject matter expertise and creativity. These findings suggest the potential of general-purpose AI to transform academia, with roles requiring knowledge-based creativity become increasingly susceptible to technological substitution.
- [44] arXiv:2404.04442 [ pdf , ps , html , other ]
-
Title: Exploring Autonomous Agents through the Lens of Large Language Models: A ReviewComments: 47 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) are transforming artificial intelligence, enabling autonomous agents to perform diverse tasks across various domains. These agents, proficient in human-like text comprehension and generation, have the potential to revolutionize sectors from customer service to healthcare. However, they face challenges such as multimodality, human value alignment, hallucinations, and evaluation. Techniques like prompting, reasoning, tool utilization, and in-context learning are being explored to enhance their capabilities. Evaluation platforms like AgentBench, WebArena, and ToolLLM provide robust methods for assessing these agents in complex scenarios. These advancements are leading to the development of more resilient and capable autonomous agents, anticipated to become integral in our digital lives, assisting in tasks from email responses to disease diagnosis. The future of AI, with LLMs at the forefront, is promising.
- [45] arXiv:2404.04538 [ pdf , ps , html , other ]
-
Title: Soft-Prompting with Graph-of-Thought for Multi-modal Representation LearningComments: This paper is accepted to LREC-COLING 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
- [46] arXiv:2404.04540 [ pdf , ps , html , other ]
-
Title: The Case for Developing a Foundation Model for Planning-like Tasks from ScratchSubjects: Artificial Intelligence (cs.AI)
Abstract: Foundation Models (FMs) have revolutionized many areas of computing, including Automated Planning and Scheduling (APS). For example, a recent study found them useful for planning problems: plan generation, language translation, model construction, multi-agent planning, interactive planning, heuristics optimization, tool integration, and brain-inspired planning. Besides APS, there are many seemingly related tasks involving the generation of a series of actions with varying guarantees of their executability to achieve intended goals, which we collectively call planning-like (PL) tasks like business processes, programs, workflows, and guidelines, where researchers have considered using FMs. However, previous works have primarily focused on pre-trained, off-the-shelf FMs and optionally fine-tuned them. This paper discusses the need for a comprehensive FM for PL tasks from scratch and explores its design considerations. We argue that such an FM will open new and efficient avenues for PL problem-solving, just like LLMs are creating for APS.
- [47] arXiv:2404.04619 [ pdf , ps , html , other ]
-
Title: Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single ModelZhonghan Zhao , Ke Ma , Wenhao Chai , Xuan Wang , Kewei Chen , Dongxu Guo , Yanting Zhang , Hongwei Wang , Gaoang WangComments: arXiv admin note: text overlap with arXiv:2403.08282Subjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.
- [48] arXiv:2404.04667 [ pdf , ps , html , other ]
-
Title: Autonomous Artificial Intelligence Agents for Clinical Decision Making in OncologyDyke Ferber , Omar S. M. El Nahhas , Georg Wölflein , Isabella C. Wiest , Jan Clusmann , Marie-Elisabeth Leßman , Sebastian Foersch , Jacqueline Lammert , Maximilian Tschochohei , Dirk Jäger , Manuel Salto-Tellez , Nikolaus Schultz , Daniel Truhn , Jakob Nikolas KatherComments: 91 pages, 2 FiguresSubjects: Artificial Intelligence (cs.AI) ; Tissues and Organs (q-bio.TO)
Abstract: Multimodal artificial intelligence (AI) systems have the potential to enhance clinical decision-making by interpreting various types of medical data. However, the effectiveness of these models across all medical fields is uncertain. Each discipline presents unique challenges that need to be addressed for optimal performance. This complexity is further increased when attempting to integrate different fields into a single model. Here, we introduce an alternative approach to multimodal medical AI that utilizes the generalist capabilities of a large language model (LLM) as a central reasoning engine. This engine autonomously coordinates and deploys a set of specialized medical AI tools. These tools include text, radiology and histopathology image interpretation, genomic data processing, web searches, and document retrieval from medical guidelines. We validate our system across a series of clinical oncology scenarios that closely resemble typical patient care workflows. We show that the system has a high capability in employing appropriate tools (97%), drawing correct conclusions (93.6%), and providing complete (94%), and helpful (89.2%) recommendations for individual patient cases while consistently referencing relevant literature (82.5%) upon instruction. This work provides evidence that LLMs can effectively plan and execute domain-specific models to retrieve or synthesize new information when used as autonomous agents. This enables them to function as specialist, patient-tailored clinical assistants. It also simplifies regulatory compliance by allowing each component tool to be individually validated and approved. We believe, that our work can serve as a proof-of-concept for more advanced LLM-agents in the medical domain.
- [49] arXiv:2404.04735 [ pdf , ps , html , other ]
-
Title: MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical ProblemsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Abstract: Recent advancements in large language models, such as GPT-4, have demonstrated remarkable capabilities in processing standard queries. Despite these advancements, their performance substantially declines in \textbf{advanced mathematical problems requiring complex, multi-step logical reasoning}. To enhance their inferential capabilities, current research has delved into \textit{prompting engineering}, exemplified by methodologies such as the Tree of Thought and Graph of Thought. Nonetheless, these existing approaches encounter two significant limitations. Firstly, their effectiveness in tackling complex mathematical problems is somewhat constrained. Secondly, the necessity to design distinct prompts for individual problems hampers their generalizability. In response to these limitations, this paper introduces the \textit{Multi-Agent System for conditional Mining} (\textbf{MACM}) prompting method. It not only resolves intricate mathematical problems but also demonstrates strong generalization capabilities across various mathematical contexts. With the assistance of MACM, the accuracy of GPT-4 Turbo on the most challenging level five mathematical problems in the MATH dataset increase from $\mathbf{54.68\%} \text{ to } \mathbf{76.73\%}$. The code is available in \url{ this https URL }.
- [50] arXiv:2404.04752 [ pdf , ps , html , other ]
-
Title: Challenges Faced by Large Language Models in Solving Multi-Agent FlockingSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: Flocking is a behavior where multiple agents in a system attempt to stay close to each other while avoiding collision and maintaining a desired formation. This is observed in the natural world and has applications in robotics, including natural disaster search and rescue, wild animal tracking, and perimeter surveillance and patrol. Recently, large language models (LLMs) have displayed an impressive ability to solve various collaboration tasks as individual decision-makers. Solving multi-agent flocking with LLMs would demonstrate their usefulness in situations requiring spatial and decentralized decision-making. Yet, when LLM-powered agents are tasked with implementing multi-agent flocking, they fall short of the desired behavior. After extensive testing, we find that agents with LLMs as individual decision-makers typically opt to converge on the average of their initial positions or diverge from each other. After breaking the problem down, we discover that LLMs cannot understand maintaining a shape or keeping a distance in a meaningful way. Solving multi-agent flocking with LLMs would enhance their ability to understand collaborative spatial reasoning and lay a foundation for addressing more complex multi-agent tasks. This paper discusses the challenges LLMs face in multi-agent flocking and suggests areas for future improvement and research.
- [51] arXiv:2404.04818 [ pdf , ps , html , other ]
-
Title: DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity LinkingShezheng Song , Shasha Li , Shan Zhao , Xiaopeng Li , Chengyu Wang , Jie Yu , Jun Ma , Tianwei Yan , Bin Ji , Xiaoguang MaoComments: under review on TOIS. arXiv admin note: substantial text overlap with arXiv:2312.11816Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inconsistency between the entity in knowledge base and its representation. To this end, we propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. This is achieved by three aspects: (a)we introduce a method for extracting fine-grained image features by partitioning the image into multiple local objects. Then, hierarchical contrastive learning is used to further align semantics between coarse-grained information(text and image) and fine-grained (mention and visual objects). (b)we explore ways to extract visual attributes from images to enhance fusion feature such as facial features and identity. (c)we leverage Wikipedia and ChatGPT to capture the entity representation, achieving semantic enrichment from both static and dynamic perspectives, which better reflects the real-world entity semantics. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance. Specifically, we optimize these datasets and achieve state-of-the-art performance on the enhanced datasets. The code and enhanced datasets are released on this https URL
- [52] arXiv:2404.04902 [ pdf , ps , html , other ]
-
Title: AI2Apps: A Visual IDE for Building LLM-based AI Agent ApplicationsSubjects: Artificial Intelligence (cs.AI) ; Software Engineering (cs.SE)
Abstract: We introduce AI2Apps, a Visual Integrated Development Environment (Visual IDE) with full-cycle capabilities that accelerates developers to build deployable LLM-based AI agent Applications. This Visual IDE prioritizes both the Integrity of its development tools and the Visuality of its components, ensuring a smooth and efficient building experience.On one hand, AI2Apps integrates a comprehensive development toolkit ranging from a prototyping canvas and AI-assisted code editor to agent debugger, management system, and deployment tools all within a web-based graphical user interface. On the other hand, AI2Apps visualizes reusable front-end and back-end code as intuitive drag-and-drop components. Furthermore, a plugin system named AI2Apps Extension (AAE) is designed for Extensibility, showcasing how a new plugin with 20 components enables web agent to mimic human-like browsing behavior. Our case study demonstrates substantial efficiency improvements, with AI2Apps reducing token consumption and API calls when debugging a specific sophisticated multimodal agent by approximately 90% and 80%, respectively. The AI2Apps, including an online demo, open-source code, and a screencast video, is now publicly accessible.
- [53] arXiv:2404.05012 [ pdf , ps , html , other ]
-
Title: Towards Reliable and Empathetic Depression-Diagnosis-Oriented ChatsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Chatbots can serve as a viable tool for preliminary depression diagnosis via interactive conversations with potential patients. Nevertheless, the blend of task-oriented and chit-chat in diagnosis-related dialogues necessitates professional expertise and empathy. Such unique requirements challenge traditional dialogue frameworks geared towards single optimization goals. To address this, we propose an innovative ontology definition and generation framework tailored explicitly for depression diagnosis dialogues, combining the reliability of task-oriented conversations with the appeal of empathy-related chit-chat. We further apply the framework to D$^4$, the only existing public dialogue dataset on depression diagnosis-oriented chats. Exhaustive experimental results indicate significant improvements in task completion and emotional support generation in depression diagnosis, fostering a more comprehensive approach to task-oriented chat dialogue system development and its applications in digital mental health.
- [54] arXiv:2404.05074 [ pdf , ps , html , other ]
-
Title: On the Uniqueness of Solution for the Bellman Equation of LTL ObjectivesComments: Accepted for the 2024 Learning for Dynamics and Control Conference (L4DC)Subjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Surrogate rewards for linear temporal logic (LTL) objectives are commonly utilized in planning problems for LTL objectives. In a widely-adopted surrogate reward approach, two discount factors are used to ensure that the expected return approximates the satisfaction probability of the LTL objective. The expected return then can be estimated by methods using the Bellman updates such as reinforcement learning. However, the uniqueness of the solution to the Bellman equation with two discount factors has not been explicitly discussed. We demonstrate with an example that when one of the discount factors is set to one, as allowed in many previous works, the Bellman equation may have multiple solutions, leading to inaccurate evaluation of the expected return. We then propose a condition for the Bellman equation to have the expected return as the unique solution, requiring the solutions for states inside a rejecting bottom strongly connected component (BSCC) to be 0. We prove this condition is sufficient by showing that the solutions for the states with discounting can be separated from those for the states without discounting under this condition
- [55] arXiv:2404.05223 [ pdf , ps , html , other ]
-
Title: ITA-ECBS: A Bounded-Suboptimal Algorithm for the Combined Target-Assignment and Path-Finding ProblemSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Multi-Agent Path Finding (MAPF), i.e., finding collision-free paths for multiple robots, plays a critical role in many applications. Sometimes, assigning a target to each agent also presents a challenge. The Combined Target-Assignment and Path-Finding (TAPF) problem, a variant of MAPF, requires one to simultaneously assign targets to agents and plan collision-free paths for agents. Several algorithms, including CBM, CBS-TA, and ITA-CBS, optimally solve the TAPF problem, with ITA-CBS being the leading algorithm for minimizing flowtime. However, the only existing bounded-suboptimal algorithm ECBS-TA is derived from CBS-TA rather than ITA-CBS. So, it faces the same issues as CBS-TA, such as searching through multiple constraint trees and spending too much time on finding the next-best target assignment. We introduce ITA-ECBS, the first bounded-suboptimal variant of ITA-CBS. Transforming ITA-CBS to its bounded-suboptimal variant is challenging because different constraint tree nodes can have different assignments of targets to agents. ITA-ECBS uses focal search to achieve efficiency and determines target assignments based on a new lower bound matrix. We show that it runs faster than ECBS-TA in 87.42% of 54,033 test cases.
- [56] arXiv:2404.05224 [ pdf , ps , html , other ]
-
Title: Iof-maint -- Modular maintenance ontologySubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: In this paper we present a publicly-available maintenance ontology (Iof-maint). Iof-maint is a modular ontology aligned with the Industrial Ontology Foundry Core (IOF Core) and contains 20 classes and 2 relations. It provides a set of maintenance-specific terms used in a wide variety of practical data-driven use cases. Iof-maint supports OWL DL reasoning, is documented, and is actively maintained on GitHub. In this paper, we describe the evolution of the Iof-maint reference ontology based on the extraction of common concepts identified in a number of application ontologies working with industry maintenance work order, procedure and failure mode data.
- [57] arXiv:2404.05235 [ pdf , ps , html , other ]
-
Title: Novelty Heuristics, Multi-Queue Search, and Portfolios for Numeric PlanningComments: Extended version of SoCS 2024 paperSubjects: Artificial Intelligence (cs.AI)
Abstract: Heuristic search is a powerful approach for solving planning problems and numeric planning is no exception. In this paper, we boost the performance of heuristic search for numeric planning with various powerful techniques orthogonal to improving heuristic informedness: numeric novelty heuristics, the Manhattan distance heuristic, and exploring the use of multi-queue search and portfolios for combining heuristics.
- [58] arXiv:2404.05259 [ pdf , ps , other ]
-
Title: Cellular automata, many-valued logic, and deep neural networksSubjects: Artificial Intelligence (cs.AI)
Abstract: We develop a theory characterizing the fundamental capability of deep neural networks to learn, from evolution traces, the logical rules governing the behavior of cellular automata (CA). This is accomplished by first establishing a novel connection between CA and Lukasiewicz propositional logic. While binary CA have been known for decades to essentially perform operations in Boolean logic, no such relationship exists for general CA. We demonstrate that many-valued (MV) logic, specifically Lukasiewicz propositional logic, constitutes a suitable language for characterizing general CA as logical machines. This is done by interpolating CA transition functions to continuous piecewise linear functions, which, by virtue of the McNaughton theorem, yield formulae in MV logic characterizing the CA. Recognizing that deep rectified linear unit (ReLU) networks realize continuous piecewise linear functions, it follows that these formulae are naturally extracted from CA evolution traces by deep ReLU networks. A corresponding algorithm together with a software implementation is provided. Finally, we show that the dynamical behavior of CA can be realized by recurrent neural networks.
- [59] arXiv:2404.05272 [ pdf , ps , html , other ]
-
Title: Constructing Data Transaction Chains Based on Opportunity Cost ExplorationSubjects: Artificial Intelligence (cs.AI)
Abstract: Data trading is increasingly gaining attention. However, the inherent replicability and privacy concerns of data make it challenging to directly apply traditional trading theories to data markets. This paper compares data trading markets with traditional ones, focusing particularly on how the replicability and privacy of data impact data markets. We discuss how data's replicability fundamentally alters the concept of opportunity cost in traditional microeconomics within the context of data markets. Additionally, we explore how to leverage this change to maximize benefits without compromising data privacy. This paper outlines the constraints for data circulation within the privacy domain chain and presents a model that maximizes data's value under these constraints. Specific application scenarios are provided, and experiments demonstrate the solvability of this model.
- [60] arXiv:2404.05424 [ pdf , ps , other ]
-
Title: What Are the Odds? Improving the foundations of Statistical Model CheckingSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: Markov decision processes (MDPs) are a fundamental model for decision making under uncertainty. They exhibit non-deterministic choice as well as probabilistic uncertainty. Traditionally, verification algorithms assume exact knowledge of the probabilities that govern the behaviour of an MDP. As this assumption is often unrealistic in practice, statistical model checking (SMC) was developed in the past two decades. It allows to analyse MDPs with unknown transition probabilities and provide probably approximately correct (PAC) guarantees on the result. Model-based SMC algorithms sample the MDP and build a model of it by estimating all transition probabilities, essentially for every transition answering the question: ``What are the odds?'' However, so far the statistical methods employed by the state of the art SMC algorithms are quite naive. Our contribution are several fundamental improvements to those methods: On the one hand, we survey statistics literature for better concentration inequalities; on the other hand, we propose specialised approaches that exploit our knowledge of the MDP. Our improvements are generally applicable to many kinds of problem statements because they are largely independent of the setting. Moreover, our experimental evaluation shows that they lead to significant gains, reducing the number of samples that the SMC algorithm has to collect by up to two orders of magnitude.
- [61] arXiv:2404.05440 [ pdf , ps , other ]
-
Title: Tree Search-Based Policy Optimization under Stochastic Execution DelayComments: Published in ICLR 2024Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The code is available at this http URL .
- [62] arXiv:2404.05569 [ pdf , ps , html , other ]
-
Title: 360{\deg}REA: Towards A Reusable Experience Accumulation with 360{\deg} Assessment for Multi-Agent SystemShen Gao , Hao Li , Zhengliang Shi , Chengrui Huang , Quan Tu , Zhiliang Tian , Minlie Huang , Shuo ShangSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Abstract: Large language model agents have demonstrated remarkable advancements across various complex tasks. Recent works focus on optimizing the agent team or employing self-reflection to iteratively solve complex tasks. Since these agents are all based on the same LLM, only conducting self-evaluation or removing underperforming agents does not substantively enhance the capability of the agents. We argue that a comprehensive evaluation and accumulating experience from evaluation feedback is an effective approach to improving system performance. In this paper, we propose Reusable Experience Accumulation with 360° Assessment (360°REA), a hierarchical multi-agent framework inspired by corporate organizational practices. The framework employs a novel 360° performance assessment method for multi-perspective performance evaluation with fine-grained assessment. To enhance the capability of agents in addressing complex tasks, we introduce dual-level experience pool for agents to accumulate experience through fine-grained assessment. Extensive experiments on complex task datasets demonstrate the effectiveness of 360°REA.
- [63] arXiv:2404.05735 [ pdf , ps , html , other ]
-
Title: A Python Framework for Neutrosophic Sets and MappingsComments: 38 PAGESJournal-ref: Neutrosophic Sets and Systems 65, 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: In this paper we present an open source framework developed in Python and consisting of three distinct classes designed to manipulate in a simple and intuitive way both symbolic representations of neutrosophic sets over universes of various types as well as mappings between them. The capabilities offered by this framework extend and generalize previous attempts to provide software solutions to the manipulation of neutrosophic sets such as those proposed by Salama et al., Saranya et al., El-Ghareeb, Topal et al. and Sleem. The code is described in detail and many examples and use cases are also provided.
- [64] arXiv:2404.05893 [ pdf , ps , other ]
-
Title: Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language ModelsSowmya S. Sundaram , Benjamin Solomon , Avani Khatri , Anisha Laumas , Purvesh Khatri , Mark A. MusenSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Information Retrieval (cs.IR)
Abstract: Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.01). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base.
- [65] arXiv:2404.06325 [ pdf , ps , html , other ]
-
Title: Automatically Learning HTN Methods from LandmarksComments: This work has been submitted to FLAIRS-24Subjects: Artificial Intelligence (cs.AI)
Abstract: Hierarchical Task Network (HTN) planning usually requires a domain engineer to provide manual input about how to decompose a planning problem. Even HTN-MAKER, a well-known method-learning algorithm, requires a domain engineer to annotate the tasks with information about what to learn. We introduce CURRICULAMA, an HTN method learning algorithm that completely automates the learning process. It uses landmark analysis to compose annotated tasks and leverages curriculum learning to order the learning of methods from simpler to more complex. This eliminates the need for manual input, resolving a core issue with HTN-MAKER. We prove CURRICULAMA's soundness, and show experimentally that it has a substantially similar convergence rate in learning a complete set of methods to HTN-MAKER.
- [66] arXiv:2404.06345 [ pdf , ps , html , other ]
-
Title: AgentsCoDriver: Large Language Model Empowered Collaborative Driving with Lifelong LearningSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Connected and autonomous driving is developing rapidly in recent years. However, current autonomous driving systems, which are primarily based on data-driven approaches, exhibit deficiencies in interpretability, generalization, and continuing learning capabilities. In addition, the single-vehicle autonomous driving systems lack of the ability of collaboration and negotiation with other vehicles, which is crucial for the safety and efficiency of autonomous driving systems. In order to address these issues, we leverage large language models (LLMs) to develop a novel framework, AgentsCoDriver, to enable multiple vehicles to conduct collaborative driving. AgentsCoDriver consists of five modules: observation module, reasoning engine, cognitive memory module, reinforcement reflection module, and communication module. It can accumulate knowledge, lessons, and experiences over time by continuously interacting with the environment, thereby making itself capable of lifelong learning. In addition, by leveraging the communication module, different agents can exchange information and realize negotiation and collaboration in complex traffic environments. Extensive experiments are conducted and show the superiority of AgentsCoDriver.
- [67] arXiv:2404.06370 [ pdf , ps , other ]
-
Title: Enhancing Decision Analysis with a Large Language Model: pyDecision a Comprehensive Library of MCDA Methods in PythonValdecy Pereira , Marcio Pereira Basilio , Carlos Henrique Tarjano SantosCarlos Henrique Tarjano SantosComments: 23 pages, 2 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Purpose: Multicriteria decision analysis (MCDA) has become increasingly essential for decision-making in complex environments. In response to this need, the pyDecision library, implemented in Python and available at this https URL , has been developed to provide a comprehensive and accessible collection of MCDA methods. Methods: The pyDecision offers 70 MCDA methods, including AHP, TOPSIS, and the PROMETHEE and ELECTRE families. Beyond offering a vast range of techniques, the library provides visualization tools for more intuitive results interpretation. In addition to these features, pyDecision has integrated ChatGPT, an advanced Large Language Model, where decision-makers can use ChatGPT to discuss and compare the outcomes of different methods, providing a more interactive and intuitive understanding of the solutions. Findings: Large Language Models are undeniably potent but can sometimes be a double-edged sword. Its answers may be misleading without rigorous verification of its outputs, especially for researchers lacking deep domain expertise. It's imperative to approach its insights with a discerning eye and a solid foundation in the relevant field. Originality: With the integration of MCDA methods and ChatGPT, pyDecision is a significant contribution to the scientific community, as it is an invaluable resource for researchers, practitioners, and decision-makers navigating complex decision-making problems and seeking the most appropriate solutions based on MCDA methods.
- [68] arXiv:2404.06405 [ pdf , ps , html , other ]
-
Title: Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO GeometryComments: Work in Progress. Released for wider feedbackSubjects: Artificial Intelligence (cs.AI) ; Computational Geometry (cs.CG); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.
- [69] arXiv:2404.06411 [ pdf , ps , html , other ]
-
Title: AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsLuca Gioacchini , Giuseppe Siracusano , Davide Sanvito , Kiril Gashteovski , David Friede , Roberto Bifulco , Carolin LawrenceComments: Accepted at the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2024)Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest -- a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we make it available under this https URL .
- [70] arXiv:2404.06474 [ pdf , ps , html , other ]
-
Title: Autonomous Evaluation and Refinement of Digital AgentsComments: Code at this https URLSubjects: Artificial Intelligence (cs.AI)
Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve a 75% relative improvement in a challenging domain transfer scenario.
- [71] arXiv:2404.06571 [ pdf , ps , other ]
-
Title: Building A Knowledge Graph to Enrich ChatGPT Responses in Manufacturing Service DiscoverySubjects: Artificial Intelligence (cs.AI)
Abstract: Sourcing and identification of new manufacturing partners is crucial for manufacturing system integrators to enhance agility and reduce risk through supply chain diversification in the global economy. The advent of advanced large language models has captured significant interest, due to their ability to generate comprehensive and articulate responses across a wide range of knowledge domains. However, the system often falls short in accuracy and completeness when responding to domain-specific inquiries, particularly in areas like manufacturing service discovery. This research explores the potential of leveraging Knowledge Graphs in conjunction with ChatGPT to streamline the process for prospective clients in identifying small manufacturing enterprises. In this study, we propose a method that integrates bottom-up ontology with advanced machine learning models to develop a Manufacturing Service Knowledge Graph from an array of structured and unstructured data sources, including the digital footprints of small-scale manufacturers throughout North America. The Knowledge Graph and the learned graph embedding vectors are leveraged to tackle intricate queries within the digital supply chain network, responding with enhanced reliability and greater interpretability. The approach highlighted is scalable to millions of entities that can be distributed to form a global Manufacturing Service Knowledge Network Graph that can potentially interconnect multiple types of Knowledge Graphs that span industry sectors, geopolitical boundaries, and business domains. The dataset developed for this study, now publicly accessible, encompasses more than 13,000 manufacturers' weblinks, manufacturing services, certifications, and location entity types.
- [72] arXiv:2404.06609 [ pdf , ps , html , other ]
-
Title: GOAT-Bench: A Benchmark for Multi-Modal Lifelong NavigationMukul Khanna , Ram Ramrakhya , Gunjan Chhablani , Sriram Yenamandra , Theophile Gervet , Matthew Chang , Zsolt Kira , Devendra Singh Chaplot , Dhruv Batra , Roozbeh MottaghiSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios.
- [73] arXiv:2404.06681 [ pdf , ps , html , other ]
-
Title: Causal Unit Selection using Tractable Arithmetic CircuitsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: The unit selection problem aims to find objects, called units, that optimize a causal objective function which describes the objects' behavior in a causal context (e.g., selecting customers who are about to churn but would most likely change their mind if encouraged). While early studies focused mainly on bounding a specific class of counterfactual objective functions using data, more recent work allows one to find optimal units exactly by reducing the causal objective to a classical objective on a meta-model, and then applying a variant of the classical Variable Elimination (VE) algorithm to the meta-model -- assuming a fully specified causal model is available. In practice, however, finding optimal units using this approach can be very expensive because the used VE algorithm must be exponential in the constrained treewidth of the meta-model, which is larger and denser than the original model. We address this computational challenge by introducing a new approach for unit selection that is not necessarily limited by the constrained treewidth. This is done through compiling the meta-model into a special class of tractable arithmetic circuits that allows the computation of optimal units in time linear in the circuit size. We finally present empirical results on random causal models that show order-of-magnitude speedups based on the proposed method for solving unit selection.
- [74] arXiv:2404.06946 [ pdf , ps , html , other ]
-
Title: A Survey on the Integration of Generative AI for Critical Thinking in Mobile NetworksAthanasios Karapantelakis , Alexandros Nikou , Ajay Kattepur , Jean Martins , Leonid Mokrushin , Swarup Kumar Mohalik , Marin Orlic , Aneta Vulgarakis FeljanComments: 14 pages, 3 figures, 4 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: In the near future, mobile networks are expected to broaden their services and coverage to accommodate a larger user base and diverse user needs. Thus, they will increasingly rely on artificial intelligence (AI) to manage network operation and control costs, undertaking complex decision-making roles. This shift will necessitate the application of techniques that incorporate critical thinking abilities, including reasoning and planning. Symbolic AI techniques already facilitate critical thinking based on existing knowledge. Yet, their use in telecommunications is hindered by the high cost of mostly manual curation of this knowledge and high computational complexity of reasoning tasks. At the same time, there is a spurt of innovations in industries such as telecommunications due to Generative AI (GenAI) technologies, operating independently of human-curated knowledge. However, their capacity for critical thinking remains uncertain. This paper aims to address this gap by examining the current status of GenAI algorithms with critical thinking capabilities and investigating their potential applications in telecom networks. Specifically, the aim of this study is to offer an introduction to the potential utilization of GenAI for critical thinking techniques in mobile networks, while also establishing a foundation for future research.
- [75] arXiv:2404.07139 [ pdf , ps , html , other ]
-
Title: Towards a Game-theoretic Understanding of Explanation-based Membership Inference AttacksComments: arXiv admin note: text overlap with arXiv:2202.02659Subjects: Artificial Intelligence (cs.AI) ; Computer Science and Game Theory (cs.GT)
Abstract: Model explanations improve the transparency of black-box machine learning (ML) models and their decisions; however, they can also be exploited to carry out privacy threats such as membership inference attacks (MIA). Existing works have only analyzed MIA in a single "what if" interaction scenario between an adversary and the target ML model; thus, it does not discern the factors impacting the capabilities of an adversary in launching MIA in repeated interaction settings. Additionally, these works rely on assumptions about the adversary's knowledge of the target model's structure and, thus, do not guarantee the optimality of the predefined threshold required to distinguish the members from non-members. In this paper, we delve into the domain of explanation-based threshold attacks, where the adversary endeavors to carry out MIA attacks by leveraging the variance of explanations through iterative interactions with the system comprising of the target ML model and its corresponding explanation method. We model such interactions by employing a continuous-time stochastic signaling game framework. In our framework, an adversary plays a stopping game, interacting with the system (having imperfect information about the type of an adversary, i.e., honest or malicious) to obtain explanation variance information and computing an optimal threshold to determine the membership of a datapoint accurately. First, we propose a sound mathematical formulation to prove that such an optimal threshold exists, which can be used to launch MIA. Then, we characterize the conditions under which a unique Markov perfect equilibrium (or steady state) exists in this dynamic system. By means of a comprehensive set of simulations of the proposed game model, we assess different factors that can impact the capability of an adversary to launch MIA in such repeated interaction settings.
- [76] arXiv:2404.07198 [ pdf , ps , html , other ]
-
Title: Zero-shot Logical Query Reasoning on any Knowledge GraphSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Complex logical query answering (CLQA) in knowledge graphs (KGs) goes beyond simple KG completion and aims at answering compositional queries comprised of multiple projections and logical operations. Existing CLQA methods that learn parameters bound to certain entity or relation vocabularies can only be applied to the graph they are trained on which requires substantial training time before being deployed on a new graph. Here we present UltraQuery, an inductive reasoning model that can zero-shot answer logical queries on any KG. The core idea of UltraQuery is to derive both projections and logical operations as vocabulary-independent functions which generalize to new entities and relations in any KG. With the projection operation initialized from a pre-trained inductive KG reasoning model, UltraQuery can solve CLQA on any KG even if it is only finetuned on a single dataset. Experimenting on 23 datasets, UltraQuery in the zero-shot inference mode shows competitive or better query answering performance than best available baselines and sets a new state of the art on 14 of them.
- [77] arXiv:2404.07227 [ pdf , ps , html , other ]
-
Title: Is Complexity an Illusion?Comments: Definitions shared with arXiv:2302.00843Subjects: Artificial Intelligence (cs.AI)
Abstract: Simplicity is held by many to be the key to general intelligence. Simpler models tend to "generalise", identifying the cause or generator of data with greater sample efficiency. The implications of the correlation between simplicity and generalisation extend far beyond computer science, addressing questions of physics and even biology. Yet simplicity is a property of form, while generalisation is of function. In interactive settings, any correlation between the two depends on interpretation. In theory there could be no correlation and yet in practice, there is. Previous theoretical work showed generalisation to be a consequence of "weak" constraints on implied by function, not form. Experiments demonstrated choosing weak constraints over simple forms yielded a 110-500% improvement in generalisation rate. Here we show that all constraints can take equally simple forms, regardless of weakness. However if forms are spatially extended, then function is represented using a finite subset of forms. If function is represented using a finite subset of forms, then we can force a correlation between simplicity and generalisation by making weak constraints take simple forms. If function determined by a goal directed process (e.g. natural selection), then efficiency demands weak constraints take simple forms. Complexity has no causal influence on generalisation, but appears to due to confounding.
- [78] arXiv:2404.07382 [ pdf , ps , html , other ]
-
Title: Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic ProvingChenyang An , Zhibo Chen , Qihao Ye , Emily First , Letian Peng , Jiayun Zhang , Zihan Wang , Sorin Lerner , Jingbo ShangComments: Submitted to ACL on Feb.15th 2024Subjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: Recent advances in Automated Theorem Proving have shown the effectiveness of leveraging a (large) language model that generates tactics (i.e. proof steps) to search through proof states. The current model, while trained solely on successful proof paths, faces a discrepancy at the inference stage, as it must sample and try various tactics at each proof state until finding success, unlike its training which does not incorporate learning from failed attempts. Intuitively, a tactic that leads to a failed search path would indicate that similar tactics should receive less attention during the following trials. In this paper, we demonstrate the benefit of training models that additionally learn from failed search paths. Facing the lack of such trial-and-error data in existing open-source theorem-proving datasets, we curate a dataset on intuitionistic propositional logic theorems and formalize it in Lean, such that we can reliably check the correctness of proofs. We compare our model trained on relatively short trial-and-error information (TrialMaster) with models trained only on the correct paths and discover that the former solves more unseen theorems with lower trial searches.
- [79] arXiv:2404.07439 [ pdf , ps , html , other ]
-
Title: Behavior Trees Enable Structured Programming of Language Model AgentsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Language models trained on internet-scale data sets have shown an impressive ability to solve problems in Natural Language Processing and Computer Vision. However, experience is showing that these models are frequently brittle in unexpected ways, and require significant scaffolding to ensure that they operate correctly in the larger systems that comprise "language-model agents." In this paper, we argue that behavior trees provide a unifying framework for combining language models with classical AI and traditional programming. We introduce Dendron, a Python library for programming language model agents using behavior trees. We demonstrate the approach embodied by Dendron in three case studies: building a chat agent, a camera-based infrastructure inspection agent for use on a mobile robot or vehicle, and an agent that has been built to satisfy safety constraints that it did not receive through instruction tuning or RLHF.
- [80] arXiv:2404.07456 [ pdf , ps , html , other ]
-
Title: WESE: Weak Exploration to Strong Exploitation for LLM AgentsXu Huang , Weiwen Liu , Xiaolong Chen , Xingmei Wang , Defu Lian , Yasheng Wang , Ruiming Tang , Enhong ChenSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: Recently, large language models (LLMs) have demonstrated remarkable potential as an intelligent agent. However, existing researches mainly focus on enhancing the agent's reasoning or decision-making abilities through well-designed prompt engineering or task-specific fine-tuning, ignoring the procedure of exploration and exploitation. When addressing complex tasks within open-world interactive environments, these methods exhibit limitations. Firstly, the lack of global information of environments leads to greedy decisions, resulting in sub-optimal solutions. On the other hand, irrelevant information acquired from the environment not only adversely introduces noise, but also incurs additional cost. This paper proposes a novel approach, Weak Exploration to Strong Exploitation (WESE), to enhance LLM agents in solving open-world interactive tasks. Concretely, WESE involves decoupling the exploration and exploitation process, employing a cost-effective weak agent to perform exploration tasks for global knowledge. A knowledge graph-based strategy is then introduced to store the acquired knowledge and extract task-relevant knowledge, enhancing the stronger agent in success rate and efficiency for the exploitation task. Our approach is flexible enough to incorporate diverse tasks, and obtains significant improvements in both success rates and efficiency across four interactive benchmarks.
- [81] arXiv:2404.07511 [ pdf , ps , other ]
-
Title: Generative Probabilistic Planning for Optimizing Supply Chain NetworksSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Supply chain networks in enterprises are typically composed of complex topological graphs involving various types of nodes and edges, accommodating numerous products with considerable demand and supply variability. However, as supply chain networks expand in size and complexity, traditional supply chain planning methods (e.g., those found in heuristic rule-based and operations research-based systems) tend to become locally optimal or lack computational scalability, resulting in substantial imbalances between supply and demand across nodes in the network. This paper introduces a novel Generative AI technique, which we call Generative Probabilistic Planning (GPP). GPP generates dynamic supply action plans that are globally optimized across all network nodes over the time horizon for changing objectives like maximizing profits or service levels, factoring in time-varying probabilistic demand, lead time, and production conditions. GPP leverages attention-based graph neural networks (GNN), offline deep reinforcement learning (Offline RL), and policy simulations to train generative policy models and create optimal plans through probabilistic simulations, effectively accounting for various uncertainties. Our experiments using historical data from a global consumer goods company with complex supply chain networks demonstrate that GPP accomplishes objective-adaptable, probabilistically resilient, and dynamic planning for supply chain networks, leading to significant improvements in performance and profitability for enterprises. Our work plays a pivotal role in shaping the trajectory of AI adoption within the supply chain domain.
- [82] arXiv:2404.07523 [ pdf , ps , html , other ]
-
Title: GNN-based Probabilistic Supply and Inventory Predictions in Supply Chain NetworksSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Successful supply chain optimization must mitigate imbalances between supply and demand over time. While accurate demand prediction is essential for supply planning, it alone does not suffice. The key to successful supply planning for optimal and viable execution lies in maximizing predictability for both demand and supply throughout an execution horizon. Therefore, enhancing the accuracy of supply predictions is imperative to create an attainable supply plan that matches demand without overstocking or understocking. However, in complex supply chain networks with numerous nodes and edges, accurate supply predictions are challenging due to dynamic node interactions, cascading supply delays, resource availability, production and logistic capabilities. Consequently, supply executions often deviate from their initial plans. To address this, we present the Graph-based Supply Prediction (GSP) probabilistic model. Our attention-based graph neural network (GNN) model predicts supplies, inventory, and imbalances using graph-structured historical data, demand forecasting, and original supply plan inputs. The experiments, conducted using historical data from a global consumer goods company's large-scale supply chain, demonstrate that GSP significantly improves supply and inventory prediction accuracy, potentially offering supply plan corrections to optimize executions.
- [83] arXiv:2404.07719 [ pdf , ps , html , other ]
-
Title: Reframing the Mind-Body Picture: Applying Formal Systems to the Relationship of Mind and MatterSubjects: Artificial Intelligence (cs.AI) ; Neurons and Cognition (q-bio.NC)
Abstract: This paper aims to show that a simple framework, utilizing basic formalisms from set theory and category theory, can clarify and inform our theories of the relation between mind and matter.
- [84] arXiv:2404.07732 [ pdf , ps , html , other ]
-
Title: Monte Carlo Tree Search with Boltzmann ExplorationComments: Camera ready version of NeurIPS2023 paperJournal-ref: Advances in Neural Information Processing Systems 36 (2024)Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Monte-Carlo Tree Search (MCTS) methods, such as Upper Confidence Bound applied to Trees (UCT), are instrumental to automated planning techniques. However, UCT can be slow to explore an optimal action when it initially appears inferior to other actions. Maximum ENtropy Tree-Search (MENTS) incorporates the maximum entropy principle into an MCTS approach, utilising Boltzmann policies to sample actions, naturally encouraging more exploration. In this paper, we highlight a major limitation of MENTS: optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective. We introduce two algorithms, Boltzmann Tree Search (BTS) and Decaying ENtropy Tree-Search (DENTS), that address these limitations and preserve the benefits of Boltzmann policies, such as allowing actions to be sampled faster by using the Alias method. Our empirical analysis shows that our algorithms show consistent high performance across several benchmark domains, including the game of Go.
- [85] arXiv:2404.07917 [ pdf , ps , html , other ]
-
Title: DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering DocumentationSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models like GPT4 and LLaVA against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. Key findings suggest that while MLLMs demonstrate potential in navigating technical documents, substantial limitations exist, particularly in accurately extracting and applying detailed requirements to engineering designs. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: this https URL .
- [86] arXiv:2404.07934 [ pdf , ps , html , other ]
-
Title: Goal Recognition via Linear ProgrammingComments: Submitted to JAIR April 2024Subjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: Goal Recognition is the task by which an observer aims to discern the goals that correspond to plans that comply with the perceived behavior of subject agents given as a sequence of observations. Research on Goal Recognition as Planning encompasses reasoning about the model of a planning task, the observations, and the goals using planning techniques, resulting in very efficient recognition approaches. In this article, we design novel recognition approaches that rely on the Operator-Counting framework, proposing new constraints, and analyze their constraints' properties both theoretically and empirically. The Operator-Counting framework is a technique that efficiently computes heuristic estimates of cost-to-goal using Integer/Linear Programming (IP/LP). In the realm of theory, we prove that the new constraints provide lower bounds on the cost of plans that comply with observations. We also provide an extensive empirical evaluation to assess how the new constraints improve the quality of the solution, and we found that they are especially informed in deciding which goals are unlikely to be part of the solution. Our novel recognition approaches have two pivotal advantages: first, they employ new IP/LP constraints for efficiently recognizing goals; second, we show how the new IP/LP constraints can improve the recognition of goals under both partial and noisy observability.
- [87] arXiv:2404.07960 [ pdf , ps , html , other ]
-
Title: Content Knowledge Identification with Multi-Agent Large Language Models (LLMs)Kaiqi Yang , Yucheng Chu , Taylor Darwin , Ahreum Han , Hang Li , Hongzhi Wen , Yasemin Copur-Gencturk , Jiliang Tang , Hui LiuSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: Teachers' mathematical content knowledge (CK) is of vital importance and need in teacher professional development (PD) programs. Computer-aided asynchronous PD systems are the most recent proposed PD techniques, which aim to help teachers improve their PD equally with fewer concerns about costs and limitations of time or location. However, current automatic CK identification methods, which serve as one of the core techniques of asynchronous PD systems, face challenges such as diversity of user responses, scarcity of high-quality annotated data, and low interpretability of the predictions. To tackle these challenges, we propose a Multi-Agent LLMs-based framework, LLMAgent-CK, to assess the user responses' coverage of identified CK learning goals without human annotations. By taking advantage of multi-agent LLMs in strong generalization ability and human-like discussions, our proposed LLMAgent-CK presents promising CK identifying performance on a real-world mathematical CK dataset MaCKT. Moreover, our case studies further demonstrate the working of the multi-agent framework.
- [88] arXiv:2404.07972 [ pdf , ps , other ]
-
Title: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsTianbao Xie , Danyang Zhang , Jixuan Chen , Xiaochuan Li , Siheng Zhao , Ruisheng Cao , Toh Jing Hua , Zhoujun Cheng , Dongchan Shin , Fangyu Lei , Yitao Liu , Yiheng Xu , Shuyan Zhou , Silvio Savarese , Caiming Xiong , Victor Zhong , Tao YuComments: 51 pages, 21 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at this https URL .
- [89] arXiv:2404.08020 [ pdf , ps , html , other ]
-
Title: Augmenting Knowledge Graph Hierarchies Using Neural TransformersComments: European Conference on Information Retrieval 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Knowledge graphs are useful tools to organize, recommend and sort data. Hierarchies in knowledge graphs provide significant benefit in improving understanding and compartmentalization of the data within a knowledge graph. This work leverages large language models to generate and augment hierarchies in an existing knowledge graph. For small (<100,000 node) domain-specific KGs, we find that a combination of few-shot prompting with one-shot generation works well, while larger KG may require cyclical generation. We present techniques for augmenting hierarchies, which led to coverage increase by 98% for intents and 99% for colors in our knowledge graph.
- [90] arXiv:2404.08295 [ pdf , ps , html , other ]
-
Title: Study of Emotion Concept Formation by Integrating Vision, Physiology, and Word Information using Multilayered Multimodal Latent Dirichlet AllocationComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. We would like to thank Professor Takayuki Nagai for useful discussionsSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO); Symbolic Computation (cs.SC)
Abstract: How are emotions formed? Through extensive debate and the promulgation of diverse theories , the theory of constructed emotion has become prevalent in recent research on emotions. According to this theory, an emotion concept refers to a category formed by interoceptive and exteroceptive information associated with a specific emotion. An emotion concept stores past experiences as knowledge and can predict unobserved information from acquired information. Therefore, in this study, we attempted to model the formation of emotion concepts using a constructionist approach from the perspective of the constructed emotion theory. Particularly, we constructed a model using multilayered multimodal latent Dirichlet allocation , which is a probabilistic generative model. We then trained the model for each subject using vision, physiology, and word information obtained from multiple people who experienced different visual emotion-evoking stimuli. To evaluate the model, we verified whether the formed categories matched human subjectivity and determined whether unobserved information could be predicted via categories. The verification results exceeded chance level, suggesting that emotion concept formation can be explained by the proposed model.
- [91] arXiv:2404.08404 [ pdf , ps , html , other ]
-
Title: Complexity of Probabilistic Reasoning for Neurosymbolic Classification TechniquesComments: 21 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI) ; Computational Complexity (cs.CC); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
Abstract: Neurosymbolic artificial intelligence is a growing field of research aiming to combine neural network learning capabilities with the reasoning abilities of symbolic systems. Informed multi-label classification is a sub-field of neurosymbolic AI which studies how to leverage prior knowledge to improve neural classification systems. A well known family of neurosymbolic techniques for informed classification use probabilistic reasoning to integrate this knowledge during learning, inference or both. Therefore, the asymptotic complexity of probabilistic reasoning is of cardinal importance to assess the scalability of such techniques. However, this topic is rarely tackled in the neurosymbolic literature, which can lead to a poor understanding of the limits of probabilistic neurosymbolic techniques. In this paper, we introduce a formalism for informed supervised classification tasks and techniques. We then build upon this formalism to define three abstract neurosymbolic techniques based on probabilistic reasoning. Finally, we show computational complexity results on several representation languages for prior knowledge commonly found in the neurosymbolic literature.
- [92] arXiv:2404.08511 [ pdf , ps , html , other ]
-
Title: Leveraging Multi-AI Agents for Cross-Domain Knowledge DiscoveryShiva Aryal , Tuyen Do , Bisesh Heyojoo , Sandeep Chataut , Bichar Dip Shrestha Gurung , Venkataramana Gadhamshetty , Etienne GnimpiebaSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: In the rapidly evolving field of artificial intelligence, the ability to harness and integrate knowledge across various domains stands as a paramount challenge and opportunity. This study introduces a novel approach to cross-domain knowledge discovery through the deployment of multi-AI agents, each specialized in distinct knowledge domains. These AI agents, designed to function as domain-specific experts, collaborate in a unified framework to synthesize and provide comprehensive insights that transcend the limitations of single-domain expertise. By facilitating seamless interaction among these agents, our platform aims to leverage the unique strengths and perspectives of each, thereby enhancing the process of knowledge discovery and decision-making. We present a comparative analysis of the different multi-agent workflow scenarios evaluating their performance in terms of efficiency, accuracy, and the breadth of knowledge integration. Through a series of experiments involving complex, interdisciplinary queries, our findings demonstrate the superior capability of domain specific multi-AI agent system in identifying and bridging knowledge gaps. This research not only underscores the significance of collaborative AI in driving innovation but also sets the stage for future advancements in AI-driven, cross-disciplinary research and application. Our methods were evaluated on a small pilot data and it showed a trend we expected, if we increase the amount of data we custom train the agents, the trend is expected to be more smooth.
- [93] arXiv:2404.08543 [ pdf , ps , html , other ]
-
Title: Memory Traces: Are Transformers Tulving Machines?Comments: 14 pages, 1 figure and 4 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: Memory traces--changes in the memory system that result from the perception and encoding of an event--were measured in pioneering studies by Endel Tulving and Michael J. Watkins in 1975. These and further experiments informed the maturation of Tulving's memory model, from the GAPS (General Abstract Processing System} to the SPI (Serial-Parallel Independent) model. Having current top of the line LLMs revisit the original Tulving-Watkins tests may help in assessing whether foundation models completely instantiate or not this class of psychological models.
- [94] arXiv:2404.08706 [ pdf , ps , html , other ]
-
Title: Generating Games via LLMs: An Investigation with Video Game Description LanguageSubjects: Artificial Intelligence (cs.AI)
Abstract: Recently, the emergence of large language models (LLMs) has unlocked new opportunities for procedural content generation. However, recent attempts mainly focus on level generation for specific games with defined game rules such as Super Mario Bros. and Zelda. This paper investigates the game generation via LLMs. Based on video game description language, this paper proposes an LLM-based framework to generate game rules and levels simultaneously. Experiments demonstrate how the framework works with prompts considering different combinations of context. Our findings extend the current applications of LLMs and offer new insights for generating new games in the area of procedural content generation.
- [95] arXiv:2404.08791 [ pdf , ps , html , other ]
-
Title: Handling Reward Misspecification in the Presence of Expectation MismatchSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Detecting and handling misspecified objectives, such as reward functions, has been widely recognized as one of the central challenges within the domain of Artificial Intelligence (AI) safety research. However, even with the recognition of the importance of this problem, we are unaware of any works that attempt to provide a clear definition for what constitutes (a) misspecified objectives and (b) successfully resolving such misspecifications. In this work, we use the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework called Expectation Alignment (EAL) to understand the objective misspecification and its causes. Our \EAL\ framework not only acts as an explanatory framework for existing works but also provides us with concrete insights into the limitations of existing methods to handle reward misspecification and novel solution strategies. We use these insights to propose a new interactive algorithm that uses the specified reward to infer potential user expectations about the system behavior. We show how one can efficiently implement this algorithm by mapping the inference problem into linear programs. We evaluate our method on a set of standard Markov Decision Process (MDP) benchmarks.
- [96] arXiv:2404.08837 [ pdf , ps , html , other ]
-
Title: Vehicle-to-Vehicle Charging: Model, Complexity, and HeuristicsComments: 7 pages, 6 figures, and 3 tables. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Artificial Intelligence (cs.AI)
Abstract: The rapid adoption of Electric Vehicles (EVs) poses challenges for electricity grids to accommodate or mitigate peak demand. Vehicle-to-Vehicle Charging (V2VC) has been recently adopted by popular EVs, posing new opportunities and challenges to the management and operation of EVs. We present a novel V2VC model that allows decision-makers to take V2VC into account when optimizing their EV operations. We show that optimizing V2VC is NP-Complete and find that even small problem instances are computationally challenging. We propose R-V2VC, a heuristic that takes advantage of the resulting totally unimodular constraint matrix to efficiently solve problems of realistic sizes. Our results demonstrate that R-V2VC presents a linear growth in the solution time as the problem size increases, while achieving solutions of optimal or near-optimal quality. R-V2VC can be used for real-world operations and to study what-if scenarios when evaluating the costs and benefits of V2VC.
- [97] arXiv:2404.08850 [ pdf , ps , other ]
-
Title: Assessing Economic Viability: A Comparative Analysis of Total Cost of Ownership for Domain-Adapted Large Language Models versus State-of-the-art Counterparts in Chip Design Coding AssistanceSubjects: Artificial Intelligence (cs.AI) ; Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Abstract: This paper presents a comparative analysis of total cost of ownership (TCO) and performance between domain-adapted large language models (LLM) and state-of-the-art (SoTA) LLMs , with a particular emphasis on tasks related to coding assistance for chip design. We examine the TCO and performance metrics of a domain-adaptive LLM, ChipNeMo, against two leading LLMs, Claude 3 Opus and ChatGPT-4 Turbo, to assess their efficacy in chip design coding generation. Through a detailed evaluation of the accuracy of the model, training methodologies, and operational expenditures, this study aims to provide stakeholders with critical information to select the most economically viable and performance-efficient solutions for their specific needs. Our results underscore the benefits of employing domain-adapted models, such as ChipNeMo, that demonstrate improved performance at significantly reduced costs compared to their general-purpose counterparts. In particular, we reveal the potential of domain-adapted LLMs to decrease TCO by approximately 90%-95%, with the cost advantages becoming increasingly evident as the deployment scale expands. With expansion of deployment, the cost benefits of ChipNeMo become more pronounced, making domain-adaptive LLMs an attractive option for organizations with substantial coding needs supported by LLMs
- [98] arXiv:2404.09304 [ pdf , ps , html , other ]
-
Title: Monte Carlo Search Algorithms Discovering Monte Carlo Tree Search Exploration TermsSubjects: Artificial Intelligence (cs.AI)
Abstract: Monte Carlo Tree Search and Monte Carlo Search have good results for many combinatorial problems. In this paper we propose to use Monte Carlo Search to design mathematical expressions that are used as exploration terms for Monte Carlo Tree Search algorithms. The optimized Monte Carlo Tree Search algorithms are PUCT and SHUSS. We automatically design the PUCT and the SHUSS root exploration terms. For small search budgets of 32 evaluations the discovered root exploration terms make both algorithms competitive with usual PUCT.
- [99] arXiv:2404.09305 [ pdf , ps , html , other ]
-
Title: OWLOOP: Interfaces for Mapping OWL Axioms into OOP HierarchiesComments: This manuscript details the implementation of the OWLOOP API. A simplified (and "citable") presentation of our API has been published in the SoftwareX Elsevier journal with the title "OWLOOP: A modular API to describe OWL axioms in OOP objects hierarchies" ( this https URL ). The OWLOOP API repository is available at this https URLSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO); Robotics (cs.RO); Software Engineering (cs.SE)
Abstract: The paper tackles the issue of mapping logic axioms formalised in the Ontology Web Language (OWL) within the Object-Oriented Programming (OOP) paradigm. The issues of mapping OWL axioms hierarchies and OOP objects hierarchies are due to OWL-based reasoning algorithms, which might change an OWL hierarchy at runtime; instead, OOP hierarchies are usually defined as static structures. Although programming paradigms based on reflection allow changing the OOP hierarchies at runtime and mapping OWL axioms dynamically, there are no currently available mechanisms that do not limit the reasoning algorithms. Thus, the factory-based paradigm is typically used since it decouples the OWL and OOP hierarchies. However, the factory inhibits OOP polymorphism and introduces a paradigm shift with respect to widely accepted OOP paradigms. We present the OWLOOP API, which exploits the factory to not limit reasoning algorithms, and it provides novel OOP interfaces concerning the axioms in an ontology. OWLOOP is designed to limit the paradigm shift required for using ontologies while improving, through OOP-like polymorphism, the modularity of software architectures that exploit logic reasoning. The paper details our OWL to OOP mapping mechanism, and it shows the benefits and limitations of OWLOOP through examples concerning a robot in a smart environment.
- [100] arXiv:2404.09468 [ pdf , ps , html , other ]
-
Title: MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph CompletionComments: Working in progress; Repo is available at this https URLSubjects: Artificial Intelligence (cs.AI)
Abstract: Multi-modal knowledge graphs (MMKG) store structured world knowledge containing rich multi-modal descriptive information. To overcome their inherent incompleteness, multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given MMKGs, leveraging both structural information from the triples and multi-modal information of the entities. Existing MMKGC methods usually extract multi-modal features with pre-trained models and employ a fusion module to integrate multi-modal features with triple prediction. However, this often results in a coarse handling of multi-modal data, overlooking the nuanced, fine-grained semantic details and their interactions. To tackle this shortfall, we introduce a novel framework MyGO to process, fuse, and augment the fine-grained modality information from MMKGs. MyGO tokenizes multi-modal raw data as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder. To further augment the multi-modal representations, MyGO incorporates fine-grained contrastive learning to highlight the specificity of the entity representations. Experiments on standard MMKGC benchmarks reveal that our method surpasses 20 of the latest models, underlining its superior performance. Code and data are available at this https URL
- [101] arXiv:2404.09554 [ pdf , ps , html , other ]
-
Title: Explainable Generative AI (GenXAI): A Survey, Conceptualization, and Research AgendaSubjects: Artificial Intelligence (cs.AI)
Abstract: Generative AI (GenAI) marked a shift from AI being able to recognize to AI being able to generate solutions for a wide variety of tasks. As the generated solutions and applications become increasingly more complex and multi-faceted, novel needs, objectives, and possibilities have emerged for explainability (XAI). In this work, we elaborate on why XAI has gained importance with the rise of GenAI and its challenges for explainability research. We also unveil novel and emerging desiderata that explanations should fulfill, covering aspects such as verifiability, interactivity, security, and cost. To this end, we focus on surveying existing works. Furthermore, we provide a taxonomy of relevant dimensions that allows us to better characterize existing XAI mechanisms and methods for GenAI. We discuss different avenues to ensure XAI, from training data to prompting. Our paper offers a short but concise technical background of GenAI for non-technical readers, focusing on text and images to better understand novel or adapted XAI techniques for GenAI. However, due to the vast array of works on GenAI, we decided to forego detailed aspects of XAI related to evaluation and usage of explanations. As such, the manuscript interests both technically oriented people and other disciplines, such as social scientists and information systems researchers. Our research roadmap provides more than ten directions for future investigation.
- [102] arXiv:2404.09587 [ pdf , ps , html , other ]
-
Title: German Tourism Knowledge GraphComments: 4 pages. Accepted to Poster and Demo Track of 21st European Semantic Web Conference 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Tourism is one of the most critical sectors of the global economy. Due to its heterogeneous and fragmented nature, it provides one of the most suitable use cases for knowledge graphs. In this poster, we introduce the German Tourism Knowledge Graph that integrates tourism-related data from 16 federal states of Germany and various other sources to provide a curated knowledge source for various applications. It is publicly available through GUIs and an API.
- [103] arXiv:2404.09631 [ pdf , ps , html , other ]
-
Title: Action Model Learning with GuaranteesSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper studies the problem of action model learning with full observability. Following the learning by search paradigm by Mitchell, we develop a theory for action model learning based on version spaces that interprets the task as search for hypothesis that are consistent with the learning examples. Our theoretical findings are instantiated in an online algorithm that maintains a compact representation of all solutions of the problem. Among these range of solutions, we bring attention to actions models approximating the actual transition system from below (sound models) and from above (complete models). We show how to manipulate the output of our learning algorithm to build deterministic and non-deterministic formulations of the sound and complete models and prove that, given enough examples, both formulations converge into the very same true model. Our experiments reveal their usefulness over a range of planning domains.
- [104] arXiv:2404.09848 [ pdf , ps , html , other ]
-
Title: HyperMono: A Monotonicity-aware Approach to Hyper-Relational Knowledge RepresentationSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: In a hyper-relational knowledge graph (HKG), each fact is composed of a main triple associated with attribute-value qualifiers, which express additional factual knowledge. The hyper-relational knowledge graph completion (HKGC) task aims at inferring plausible missing links in a HKG. Most existing approaches to HKGC focus on enhancing the communication between qualifier pairs and main triples, while overlooking two important properties that emerge from the monotonicity of the hyper-relational graphs representation regime. Stage Reasoning allows for a two-step reasoning process, facilitating the integration of coarse-grained inference results derived solely from main triples and fine-grained inference results obtained from hyper-relational facts with qualifiers. In the initial stage, coarse-grained results provide an upper bound for correct predictions, which are subsequently refined in the fine-grained step. More generally, Qualifier Monotonicity implies that by attaching more qualifier pairs to a main triple, we may only narrow down the answer set, but never enlarge it. This paper proposes the HyperMono model for hyper-relational knowledge graph completion, which realizes stage reasoning and qualifier monotonicity. To implement qualifier monotonicity HyperMono resorts to cone embeddings. Experiments on three real-world datasets with three different scenario conditions demonstrate the strong performance of HyperMono when compared to the SoTA.
- [105] arXiv:2404.09877 [ pdf , ps , html , other ]
-
Title: Synergising Human-like Responses and Machine Intelligence for Planning in Disaster ResponseComments: 2024 IEEE World Congress on Computational Intelligence (IEEE WCCI), 2024 International Joint Conference on Neural Networks (IJCNN)Subjects: Artificial Intelligence (cs.AI)
Abstract: In the rapidly changing environments of disaster response, planning and decision-making for autonomous agents involve complex and interdependent choices. Although recent advancements have improved traditional artificial intelligence (AI) approaches, they often struggle in such settings, particularly when applied to agents operating outside their well-defined training parameters. To address these challenges, we propose an attention-based cognitive architecture inspired by Dual Process Theory (DPT). This framework integrates, in an online fashion, rapid yet heuristic (human-like) responses (System 1) with the slow but optimized planning capabilities of machine intelligence (System 2). We illustrate how a supervisory controller can dynamically determine in real-time the engagement of either system to optimize mission objectives by assessing their performance across a number of distinct attributes. Evaluated for trajectory planning in dynamic environments, our framework demonstrates that this synergistic integration effectively manages complex tasks by optimizing multiple mission objectives.
- [106] arXiv:2404.09897 [ pdf , ps , html , other ]
-
Title: Progressive Knowledge Graph CompletionComments: 14 pages, 10 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Knowledge Graph Completion (KGC) has emerged as a promising solution to address the issue of incompleteness within Knowledge Graphs (KGs). Traditional KGC research primarily centers on triple classification and link prediction. Nevertheless, we contend that these tasks do not align well with real-world scenarios and merely serve as surrogate benchmarks. In this paper, we investigate three crucial processes relevant to real-world construction scenarios: (a) the verification process, which arises from the necessity and limitations of human verifiers; (b) the mining process, which identifies the most promising candidates for verification; and (c) the training process, which harnesses verified data for subsequent utilization; in order to achieve a transition toward more realistic challenges. By integrating these three processes, we introduce the Progressive Knowledge Graph Completion (PKGC) task, which simulates the gradual completion of KGs in real-world scenarios. Furthermore, to expedite PKGC processing, we propose two acceleration modules: Optimized Top-$k$ algorithm and Semantic Validity Filter. These modules significantly enhance the efficiency of the mining procedure. Our experiments demonstrate that performance in link prediction does not accurately reflect performance in PKGC. A more in-depth analysis reveals the key factors influencing the results and provides potential directions for future research.
- [107] arXiv:2404.09939 [ pdf , ps , html , other ]
-
Title: A Survey on Deep Learning for Theorem ProvingSubjects: Artificial Intelligence (cs.AI)
Abstract: Theorem proving is a fundamental aspect of mathematics, spanning from informal reasoning in mathematical language to rigorous derivations in formal systems. In recent years, the advancement of deep learning, especially the emergence of large language models, has sparked a notable surge of research exploring these techniques to enhance the process of theorem proving. This paper presents a pioneering comprehensive survey of deep learning for theorem proving by offering i) a thorough review of existing approaches across various tasks such as autoformalization, premise selection, proofstep generation, and proof search; ii) a meticulous summary of available datasets and strategies for data generation; iii) a detailed analysis of evaluation metrics and the performance of state-of-the-art; and iv) a critical discussion on the persistent challenges and the promising avenues for future exploration. Our survey aims to serve as a foundational reference for deep learning approaches in theorem proving, seeking to catalyze further research endeavors in this rapidly growing field.
- [108] arXiv:2404.10024 [ pdf , ps , html , other ]
-
Title: ClimODE: Climate and Weather Forecasting with Physics-informed Neural ODEsComments: Accepted as ICLR 2024 Oral. Project website: this https URLSubjects: Artificial Intelligence (cs.AI) ; Emerging Technologies (cs.ET); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: Climate and weather prediction traditionally relies on complex numerical simulations of atmospheric physics. Deep learning approaches, such as transformers, have recently challenged the simulation paradigm with complex network forecasts. However, they often act as data-driven black-box models that neglect the underlying physics and lack uncertainty quantification. We address these limitations with ClimODE, a spatiotemporal continuous-time process that implements a key principle of advection from statistical mechanics, namely, weather changes due to a spatial movement of quantities over time. ClimODE models precise weather evolution with value-conserving dynamics, learning global weather transport as a neural flow, which also enables estimating the uncertainty in predictions. Our approach outperforms existing data-driven methods in global and regional forecasting with an order of magnitude smaller parameterization, establishing a new state of the art.
- [109] arXiv:2404.10097 [ pdf , ps , other ]
-
Title: LegalPro-BERT: Classification of Legal Provisions by fine-tuning BERT Large Language ModelComments: 17 pages, 4 figuresSubjects: Artificial Intelligence (cs.AI) ; Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: A contract is a type of legal document commonly used in organizations. Contract review is an integral and repetitive process to avoid business risk and liability. Contract analysis requires the identification and classification of key provisions and paragraphs within an agreement. Identification and validation of contract clauses can be a time-consuming and challenging task demanding the services of trained and expensive lawyers, paralegals or other legal assistants. Classification of legal provisions in contracts using artificial intelligence and natural language processing is complex due to the requirement of domain-specialized legal language for model training and the scarcity of sufficient labeled data in the legal domain. Using general-purpose models is not effective in this context due to the use of specialized legal vocabulary in contracts which may not be recognized by a general model. To address this problem, we propose the use of a pre-trained large language model which is subsequently calibrated on legal taxonomy. We propose LegalPro-BERT, a BERT transformer architecture model that we fine-tune to efficiently handle classification task for legal provisions. We conducted experiments to measure and compare metrics with current benchmark results. We found that LegalPro-BERT outperforms the previous benchmark used for comparison in this research.
- [110] arXiv:2404.10102 [ pdf , ps , html , other ]
-
Title: Chinchilla Scaling: A replication attemptSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Hoffmann et al. (2022) propose three methods for estimating a compute-optimal scaling law. We attempt to replicate their third estimation procedure, which involves fitting a parametric loss function to a reconstruction of data from their plots. We find that the reported estimates are inconsistent with their first two estimation methods, fail at fitting the extracted data, and report implausibly narrow confidence intervals--intervals this narrow would require over 600,000 experiments, while they likely only ran fewer than 500. In contrast, our rederivation of the scaling law using the third approach yields results that are compatible with the findings from the first two estimation procedures described by Hoffmann et al.
- [111] arXiv:2404.10160 [ pdf , ps , html , other ]
-
Title: RLRF:Reinforcement Learning from Reflection through Debates as Feedback for Bias Mitigation in LLMsSubjects: Artificial Intelligence (cs.AI)
Abstract: Biases and stereotypes in Large Language Models (LLMs) can have negative implications for user experience and societal outcomes. Current approaches to bias mitigation like Reinforcement Learning from Human Feedback (RLHF) rely on costly manual feedback. While LLMs have the capability to understand logic and identify biases in text, they often struggle to effectively acknowledge and address their own biases due to factors such as prompt influences, internal mechanisms, and policies. We found that informing LLMs that the content they generate is not their own and questioning them about potential biases in the text can significantly enhance their recognition and improvement capabilities regarding biases. Based on this finding, we propose RLRF (Reinforcement Learning from Reflection through Debates as Feedback), replacing human feedback with AI for bias mitigation. RLRF engages LLMs in multi-role debates to expose biases and gradually reduce biases in each iteration using a ranking scoring mechanism. The dialogue are then used to create a dataset with high-bias and low-bias instances to train the reward model in reinforcement learning. This dataset can be generated by the same LLMs for self-reflection or a superior LLMs guiding the former in a student-teacher mode to enhance its logical reasoning abilities. Experimental results demonstrate the significant effectiveness of our approach in bias reduction.
- [112] arXiv:2404.10200 [ pdf , ps , html , other ]
-
Title: TEL'M: Test and Evaluation of Language ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: Language Models have demonstrated remarkable capabilities on some tasks while failing dramatically on others. The situation has generated considerable interest in understanding and comparing the capabilities of various Language Models (LMs) but those efforts have been largely ad hoc with results that are often little more than anecdotal. This is in stark contrast with testing and evaluation processes used in healthcare, radar signal processing, and other defense areas. In this paper, we describe Test and Evaluation of Language Models (TEL'M) as a principled approach for assessing the value of current and future LMs focused on high-value commercial, government and national security applications. We believe that this methodology could be applied to other Artificial Intelligence (AI) technologies as part of the larger goal of "industrializing" AI.
- [113] arXiv:2404.10209 [ pdf , ps , html , other ]
-
Title: Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language ModelsSiqiao Xue , Danrui Qi , Caigao Jiang , Wenhui Shi , Fangyin Cheng , Keting Chen , Hongjun Yang , Zhiping Zhang , Jianshan He , Hongyang Zhang , Ganglin Wei , Wang Zhao , Fan Zhou , Hong Yi , Shaodong Liu , Hongjun Yang , Faqiang ChenSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: The recent breakthroughs in large language models (LLMs) are positioned to transition many areas of software. The technologies of interacting with data particularly have an important entanglement with LLMs as efficient and intuitive data interactions are paramount. In this paper, we present DB-GPT, a revolutionary and product-ready Python library that integrates LLMs into traditional data interaction tasks to enhance user experience and accessibility. DB-GPT is designed to understand data interaction tasks described by natural language and provide context-aware responses powered by LLMs, making it an indispensable tool for users ranging from novice to expert. Its system design supports deployment across local, distributed, and cloud environments. Beyond handling basic data interaction tasks like Text-to-SQL with LLMs, it can handle complex tasks like generative data analysis through a Multi-Agents framework and the Agentic Workflow Expression Language (AWEL). The Service-oriented Multi-model Management Framework (SMMF) ensures data privacy and security, enabling users to employ DB-GPT with private LLMs. Additionally, DB-GPT offers a series of product-ready features designed to enable users to integrate DB-GPT within their product environments easily. The code of DB-GPT is available at Github( this https URL ) which already has over 10.7k stars. Please install DB-GPT for your own usage with the instructions( this https URL ) and watch a 5-minute introduction video on Youtube( this https URL ) to further investigate DB-GPT.
- [114] arXiv:2404.10226 [ pdf , ps , html , other ]
-
Title: Find The Gap: Knowledge Base Reasoning For Visual Question AnsweringSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering task-specific and LLM models with supervised external and visual knowledge retrieval models. Our findings show that though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN model even if the relevant information from both modalities is available to the model. Moreover, we observed that LLM models outperform the NN model for KB-related questions which confirms the effectiveness of implicit knowledge in LLMs however, they do not alleviate the need for external KB.
- [115] arXiv:2404.10234 [ pdf , ps , html , other ]
-
Title: Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image CompressionSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Abstract: The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
- [116] arXiv:2404.10274 [ pdf , ps , other ]
-
Title: Sparse Attention Regression Network Based Soil Fertility Prediction With UmmasoSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: The challenge of imbalanced soil nutrient datasets significantly hampers accurate predictions of soil fertility. To tackle this, a new method is suggested in this research, combining Uniform Manifold Approximation and Projection (UMAP) with Least Absolute Shrinkage and Selection Operator (LASSO). The main aim is to counter the impact of uneven data distribution and improve soil fertility models' predictive precision. The model introduced uses Sparse Attention Regression, effectively incorporating pertinent features from the imbalanced dataset. UMAP is utilized initially to reduce data complexity, unveiling hidden structures and important patterns. Following this, LASSO is applied to refine features and enhance the model's interpretability. The experimental outcomes highlight the effectiveness of the UMAP and LASSO hybrid approach. The proposed model achieves outstanding performance metrics, reaching a predictive accuracy of 98%, demonstrating its capability in accurate soil fertility predictions. Additionally, it showcases a Precision of 91.25%, indicating its adeptness in identifying fertile soil instances accurately. The Recall metric stands at 90.90%, emphasizing the model's ability to capture true positive cases effectively.
- [117] arXiv:2404.10317 [ pdf , ps , html , other ]
-
Title: LLMs4OM: Matching Ontologies with Large Language ModelsComments: 8 pages, 1 figure, accepted to ESWC 2024 Special Track on LLMs for Knowledge Engineering ( this https URL )Subjects: Artificial Intelligence (cs.AI)
Abstract: Ontology Matching (OM), is a critical task in knowledge integration, where aligning heterogeneous ontologies facilitates data interoperability and knowledge sharing. Traditional OM systems often rely on expert knowledge or predictive models, with limited exploration of the potential of Large Language Models (LLMs). We present the LLMs4OM framework, a novel approach to evaluate the effectiveness of LLMs in OM tasks. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. Through comprehensive evaluations using 20 OM datasets from various domains, we demonstrate that LLMs, under the LLMs4OM framework, can match and even surpass the performance of traditional OM systems, particularly in complex matching scenarios. Our results highlight the potential of LLMs to significantly contribute to the field of OM.
- [118] arXiv:2404.10329 [ pdf , ps , html , other ]
-
Title: Towards Complex Ontology Alignment using Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: Ontology alignment, a critical process in the Semantic Web for detecting relationships between different ontologies, has traditionally focused on identifying so-called "simple" 1-to-1 relationships through class labels and properties comparison. The more practically useful exploration of more complex alignments remains a hard problem to automate, and as such is largely underexplored, i.e. in application practice it is usually done manually by ontology and domain experts. Recently, the surge in Natural Language Processing (NLP) capabilities, driven by advancements in Large Language Models (LLMs), presents new opportunities for enhancing ontology engineering practices, including ontology alignment tasks. This paper investigates the application of LLM technologies to tackle the complex ontology alignment challenge. Leveraging a prompt-based approach and integrating rich ontology content so-called modules our work constitutes a significant advance towards automating the complex alignment task.
- [119] arXiv:2404.10337 [ pdf , ps , html , other ]
-
Title: Intriguing Properties of Positional Encoding in Time Series ForecastingSubjects: Artificial Intelligence (cs.AI)
Abstract: Transformer-based methods have made significant progress in time series forecasting (TSF). They primarily handle two types of tokens, i.e., temporal tokens that contain all variables of the same timestamp, and variable tokens that contain all input time points for a specific variable. Transformer-based methods rely on positional encoding (PE) to mark tokens' positions, facilitating the model to perceive the correlation between tokens. However, in TSF, research on PE remains insufficient. To address this gap, we conduct experiments and uncover intriguing properties of existing PEs in TSF: (i) The positional information injected by PEs diminishes as the network depth increases; (ii) Enhancing positional information in deep networks is advantageous for improving the model's performance; (iii) PE based on the similarity between tokens can improve the model's performance. Motivated by these findings, we introduce two new PEs: Temporal Position Encoding (T-PE) for temporal tokens and Variable Positional Encoding (V-PE) for variable tokens. Both T-PE and V-PE incorporate geometric PE based on tokens' positions and semantic PE based on the similarity between tokens but using different calculations. To leverage both the PEs, we design a Transformer-based dual-branch framework named T2B-PE. It first calculates temporal tokens' correlation and variable tokens' correlation respectively and then fuses the dual-branch features through the gated unit. Extensive experiments demonstrate the superior robustness and effectiveness of T2B-PE. The code is available at: \href{ this https URL }{ this https URL }.
- [120] arXiv:2404.10387 [ pdf , ps , html , other ]
-
Title: CNN-based explanation ensembling for dataset, representation and explanations evaluationComments: accepted at 2nd World Conference on eXplainable Artificial IntelligenceSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: Explainable Artificial Intelligence has gained significant attention due to the widespread use of complex deep learning models in high-stake domains such as medicine, finance, and autonomous cars. However, different explanations often present different aspects of the model's behavior. In this research manuscript, we explore the potential of ensembling explanations generated by deep classification models using convolutional model. Through experimentation and analysis, we aim to investigate the implications of combining explanations to uncover a more coherent and reliable patterns of the model's behavior, leading to the possibility of evaluating the representation learned by the model. With our method, we can uncover problems of under-representation of images in a certain class. Moreover, we discuss other side benefits like features' reduction by replacing the original image with its explanations resulting in the removal of some sensitive information. Through the use of carefully selected evaluation metrics from the Quantus library, we demonstrated the method's superior performance in terms of Localisation and Faithfulness, compared to individual explanations.
- [121] arXiv:2404.10416 [ pdf , ps , html , other ]
-
Title: Disentangling Instructive Information from Ranked Multiple Candidates for Multi-Document Scientific SummarizationComments: Accepted by SIGIR 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Automatically condensing multiple topic-related scientific papers into a succinct and concise summary is referred to as Multi-Document Scientific Summarization (MDSS). Currently, while commonly used abstractive MDSS methods can generate flexible and coherent summaries, the difficulty in handling global information and the lack of guidance during decoding still make it challenging to generate better summaries. To alleviate these two shortcomings, this paper introduces summary candidates into MDSS, utilizing the global information of the document set and additional guidance from the summary candidates to guide the decoding process. Our insights are twofold: Firstly, summary candidates can provide instructive information from both positive and negative perspectives, and secondly, selecting higher-quality candidates from multiple options contributes to producing better summaries. Drawing on the insights, we propose a summary candidates fusion framework -- Disentangling Instructive information from Ranked candidates (DIR) for MDSS. Specifically, DIR first uses a specialized pairwise comparison method towards multiple candidates to pick out those of higher quality. Then DIR disentangles the instructive information of summary candidates into positive and negative latent variables with Conditional Variational Autoencoder. These variables are further incorporated into the decoder to guide generation. We evaluate our approach with three different types of Transformer-based models and three different types of candidates, and consistently observe noticeable performance improvements according to automatic and human evaluation. More analyses further demonstrate the effectiveness of our model in handling global information and enhancing decoding controllability.
- [122] arXiv:2404.10429 [ pdf , ps , html , other ]
-
Title: MEEL: Multi-Modal Event Evolution LearningZhengwei Tao , Zhi Jin , Junqiang Huang , Xiancai Chen , Xiaoying Bai , Haiyan Zhao , Yifan Zhang , Chongyang TaoSubjects: Artificial Intelligence (cs.AI)
Abstract: Multi-modal Event Reasoning (MMER) endeavors to endow machines with the ability to comprehend intricate event relations across diverse data modalities. MMER is fundamental and underlies a wide broad of applications. Despite extensive instruction fine-tuning, current multi-modal large language models still fall short in such ability. The disparity stems from that existing models are insufficient to capture underlying principles governing event evolution in various scenarios. In this paper, we introduce Multi-Modal Event Evolution Learning (MEEL) to enable the model to grasp the event evolution mechanism, yielding advanced MMER ability. Specifically, we commence with the design of event diversification to gather seed events from a rich spectrum of scenarios. Subsequently, we employ ChatGPT to generate evolving graphs for these seed events. We propose an instruction encapsulation process that formulates the evolving graphs into instruction-tuning data, aligning the comprehension of event reasoning to humans. Finally, we observe that models trained in this way are still struggling to fully comprehend event evolution. In such a case, we propose the guiding discrimination strategy, in which models are trained to discriminate the improper evolution direction. We collect and curate a benchmark M-EV2 for MMER. Extensive experiments on M-EV2 validate the effectiveness of our approach, showcasing competitive performance in open-source multi-modal LLMs.
- [123] arXiv:2404.10498 [ pdf , ps , html , other ]
-
Title: LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Perception SystemSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaboration with large-small model co-inference offers a promising approach to achieving high inference accuracy and low latency. However, existing edge-cloud collaboration methods are tightly coupled with the model architecture and cannot adapt to the dynamic data drifts in heterogeneous IoT environments. To address the issues, we propose LAECIPS, a new edge-cloud collaboration framework. In LAECIPS, both the large vision model on the cloud and the lightweight model on the edge are plug-and-play. We design an edge-cloud collaboration strategy based on hard input mining, optimized for both high accuracy and low latency. We propose to update the edge model and its collaboration strategy with the cloud under the supervision of the large vision model, so as to adapt to the dynamic IoT data streams. Theoretical analysis of LAECIPS proves its feasibility. Experiments conducted in a robotic semantic segmentation system using real-world datasets show that LAECIPS outperforms its state-of-the-art competitors in accuracy, latency, and communication overhead while having better adaptability to dynamic environments.
- [124] arXiv:2404.10505 [ pdf , ps , html , other ]
-
Title: Data Collection of Real-Life Knowledge Work in Context: The RLKWiC DatasetComments: Accepted and presented at the 10th International Conference on Information Management (ICIM2024), will be published in Springer CCIS series Conference Proceedings (Electronic ISSN: 1865-0937; Print ISSN: 1865-0929)Subjects: Artificial Intelligence (cs.AI)
Abstract: Over the years, various approaches have been employed to enhance the productivity of knowledge workers, from addressing psychological well-being to the development of personal knowledge assistants. A significant challenge in this research area has been the absence of a comprehensive, publicly accessible dataset that mirrors real-world knowledge work. Although a handful of datasets exist, many are restricted in access or lack vital information dimensions, complicating meaningful comparison and benchmarking in the domain. This paper presents RLKWiC, a novel dataset of Real-Life Knowledge Work in Context, derived from monitoring the computer interactions of eight participants over a span of two months. As the first publicly available dataset offering a wealth of essential information dimensions (such as explicated contexts, textual contents, and semantics), RLKWiC seeks to address the research gap in the personal information management domain, providing valuable insights for modeling user behavior.
- [125] arXiv:2404.10551 [ pdf , ps , html , other ]
-
Title: The Evolution of Learning: Assessing the Transformative Impact of Generative AI on Higher EducationSubjects: Artificial Intelligence (cs.AI) ; Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
Abstract: Generative Artificial Intelligence (GAI) models such as ChatGPT have experienced a surge in popularity, attracting 100 million active users in 2 months and generating an estimated 10 million daily queries. Despite this remarkable adoption, there remains a limited understanding to which extent this innovative technology influences higher education. This research paper investigates the impact of GAI on university students and Higher Education Institutions (HEIs). The study adopts a mixed-methods approach, combining a comprehensive survey with scenario analysis to explore potential benefits, drawbacks, and transformative changes the new technology brings. Using an online survey with 130 participants we assessed students' perspectives and attitudes concerning present ChatGPT usage in academics. Results show that students use the current technology for tasks like assignment writing and exam preparation and believe it to be a effective help in achieving academic goals. The scenario analysis afterwards projected potential future scenarios, providing valuable insights into the possibilities and challenges associated with incorporating GAI into higher education. The main motivation is to gain a tangible and precise understanding of the potential consequences for HEIs and to provide guidance responding to the evolving learning environment. The findings indicate that irresponsible and excessive use of the technology could result in significant challenges. Hence, HEIs must develop stringent policies, reevaluate learning objectives, upskill their lecturers, adjust the curriculum and reconsider examination approaches.
- [126] arXiv:2404.10573 [ pdf , ps , html , other ]
-
Title: AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion GenerationLijun Liu , Jiali Yang , Jianfei Song , Xinglin Yang , Lele Niu , Zeqi Cai , Hui Shi , Tingjun Hou , Chang-yu Hsieh , Weiran Shen , Yafeng DengSubjects: Artificial Intelligence (cs.AI) ; Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
Abstract: Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications.
- [127] arXiv:2404.10579 [ pdf , ps , other ]
-
Title: The application of Augmented Reality (AR) in Remote Work and EducationSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: With the rapid advancement of technology, Augmented Reality (AR) technology, known for its ability to deeply integrate virtual information with the real world, is gradually transforming traditional work modes and teaching methods. Particularly in the realms of remote work and online education, AR technology demonstrates a broad spectrum of application prospects. This paper delves into the application potential and actual effects of AR technology in remote work and education. Through a systematic literature review, this study outlines the key features, advantages, and challenges of AR technology. Based on theoretical analysis, it discusses the scientific basis and technical support that AR technology provides for enhancing remote work efficiency and promoting innovation in educational teaching models. Additionally, by designing an empirical research plan and analyzing experimental data, this article reveals the specific performance and influencing factors of AR technology in practical applications. Finally, based on the results of the experiments, this research summarizes the application value of AR technology in remote work and education, looks forward to its future development trends, and proposes forward-looking research directions and strategic suggestions, offering empirical foundation and theoretical guidance for further promoting the in-depth application of AR technology in related fields.
- [128] arXiv:2404.10618 [ pdf , ps , html , other ]
-
Title: Private Attribute Inference from Images with Vision-Language ModelsSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: As large language models (LLMs) become ubiquitous in our daily tasks and digital interactions, associated privacy risks are increasingly in focus. While LLM privacy research has primarily focused on the leakage of model training data, it has recently been shown that the increase in models' capabilities has enabled LLMs to make accurate privacy-infringing inferences from previously unseen texts. With the rise of multimodal vision-language models (VLMs), capable of understanding both images and text, a pertinent question is whether such results transfer to the previously unexplored domain of benign images posted online. To investigate the risks associated with the image reasoning capabilities of newly emerging VLMs, we compile an image dataset with human-annotated labels of the image owner's personal attributes. In order to understand the additional privacy risk posed by VLMs beyond traditional human attribute recognition, our dataset consists of images where the inferable private attributes do not stem from direct depictions of humans. On this dataset, we evaluate the inferential capabilities of 7 state-of-the-art VLMs, finding that they can infer various personal attributes at up to 77.6% accuracy. Concerningly, we observe that accuracy scales with the general capabilities of the models, implying that future models can be misused as stronger adversaries, establishing an imperative for the development of adequate defenses.
- [129] arXiv:2404.10646 [ pdf , ps , html , other ]
-
Title: Efficient Parking Search using Shared Fleet DataComments: Long Version; published at 2021 22nd IEEE International Conference on Mobile Data Management (MDM)Journal-ref: 2021 22nd IEEE International Conference on Mobile Data Management (MDM)Subjects: Artificial Intelligence (cs.AI)
Abstract: Finding an available on-street parking spot is a relevant problem of day-to-day life. In recent years, cities such as Melbourne and San Francisco deployed sensors that provide real-time information about the occupation of parking spots. Finding a free parking spot in such a smart environment can be modeled and solved as a Markov decision process (MDP). The problem has to consider uncertainty as available parking spots might not remain available until arrival due to other vehicles also claiming spots in the meantime. Knowing the parking intention of every vehicle in the environment would eliminate this uncertainty. Unfortunately, it does currently not seem realistic to have such data from all vehicles. In contrast, acquiring data from a subset of vehicles or a vehicle fleet appears feasible and has the potential to reduce uncertainty.
In this paper, we examine the question of how useful sharing data within a vehicle fleet might be for the search times of particular drivers. We use fleet data to better estimate the availability of parking spots at arrival. Since optimal solutions for large scenarios are infeasible, we base our method on approximate solutions, which have been shown to perform well in single-agent settings. Our experiments are conducted on a simulation using real-world and synthetic data from the city of Melbourne. The results indicate that fleet data can significantly reduce search times for an available parking spot. - [130] arXiv:2404.10683 [ pdf , ps , html , other ]
-
Title: Simplex Decomposition for Portfolio Allocation Constraints in Reinforcement LearningJournal-ref: ECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Krakow, PolandSubjects: Artificial Intelligence (cs.AI)
Abstract: Portfolio optimization tasks describe sequential decision problems in which the investor's wealth is distributed across a set of assets. Allocation constraints are used to enforce minimal or maximal investments into particular subsets of assets to control for objectives such as limiting the portfolio's exposure to a certain sector due to environmental concerns. Although methods for constrained Reinforcement Learning (CRL) can optimize policies while considering allocation constraints, it can be observed that these general methods yield suboptimal results. In this paper, we propose a novel approach to handle allocation constraints based on a decomposition of the constraint action space into a set of unconstrained allocation problems. In particular, we examine this approach for the case of two constraints. For example, an investor may wish to invest at least a certain percentage of the portfolio into green technologies while limiting the investment in the fossil energy sector. We show that the action space of the task is equivalent to the decomposed action space, and introduce a new reinforcement learning (RL) approach CAOSD, which is built on top of the decomposition. The experimental evaluation on real-world Nasdaq-100 data demonstrates that our approach consistently outperforms state-of-the-art CRL benchmarks for portfolio optimization.
- [131] arXiv:2404.10731 [ pdf , ps , html , other ]
-
Title: What is Meant by AGI? On the Definition of Artificial General IntelligenceSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper aims to establish a consensus on AGI's definition. General intelligence refers to the adaptation to open environments according to certain principles using limited resources. It emphasizes that adaptation or learning is an indispensable property of intelligence, and places the controversial part within the principles of intelligence, which can be described from different perspectives.
- [132] arXiv:2404.10733 [ pdf , ps , html , other ]
-
Title: Bootstrapping Linear Models for Fast Online Adaptation in Human-Agent CollaborationComments: 10 pages, 4 figures, Accepted to AAMAS 2024Subjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Abstract: Agents that assist people need to have well-initialized policies that can adapt quickly to align with their partners' reward functions. Initializing policies to maximize performance with unknown partners can be achieved by bootstrapping nonlinear models using imitation learning over large, offline datasets. Such policies can require prohibitive computation to fine-tune in-situ and therefore may miss critical run-time information about a partner's reward function as expressed through their immediate behavior. In contrast, online logistic regression using low-capacity models performs rapid inference and fine-tuning updates and thus can make effective use of immediate in-task behavior for reward function alignment. However, these low-capacity models cannot be bootstrapped as effectively by offline datasets and thus have poor initializations. We propose BLR-HAC, Bootstrapped Logistic Regression for Human Agent Collaboration, which bootstraps large nonlinear models to learn the parameters of a low-capacity model which then uses online logistic regression for updates during collaboration. We test BLR-HAC in a simulated surface rearrangement task and demonstrate that it achieves higher zero-shot accuracy than shallow methods and takes far less computation to adapt online while still achieving similar performance to fine-tuned, large nonlinear models. For code, please see our project page this https URL .
- [133] arXiv:2404.10740 [ pdf , ps , html , other ]
-
Title: N-Agent Ad Hoc TeamworkSubjects: Artificial Intelligence (cs.AI)
Abstract: Current approaches to learning cooperative behaviors in multi-agent settings assume relatively restrictive settings. In standard fully cooperative multi-agent reinforcement learning, the learning algorithm controls \textit{all} agents in the scenario, while in ad hoc teamwork, the learning algorithm usually assumes control over only a $\textit{single}$ agent in the scenario. However, many cooperative settings in the real world are much less restrictive. For example, in an autonomous driving scenario, a company might train its cars with the same learning algorithm, yet once on the road, these cars must cooperate with cars from another company. Towards generalizing the class of scenarios that cooperative learning methods can address, we introduce $N$-agent ad hoc teamwork, in which a set of autonomous agents must interact and cooperate with dynamically varying numbers and types of teammates at evaluation time. This paper formalizes the problem, and proposes the $\textit{Policy Optimization with Agent Modelling}$ (POAM) algorithm. POAM is a policy gradient, multi-agent reinforcement learning approach to the NAHT problem, that enables adaptation to diverse teammate behaviors by learning representations of teammate behaviors. Empirical evaluation on StarCraft II tasks shows that POAM improves cooperative task returns compared to baseline approaches, and enables out-of-distribution generalization to unseen teammates.
- [134] arXiv:2404.10763 [ pdf , ps , html , other ]
-
Title: LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?Yuchi Wang , Shuhuai Ren , Rundong Gao , Linli Yao , Qingyan Guo , Kaikai An , Jianhong Bai , Xu SunSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.
- [135] arXiv:2404.10883 [ pdf , ps , html , other ]
-
Title: Automated Discovery of Functional Actual Causes in Complex EnvironmentsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: Reinforcement learning (RL) algorithms often struggle to learn policies that generalize to novel situations due to issues such as causal confusion, overfitting to irrelevant factors, and failure to isolate control of state factors. These issues stem from a common source: a failure to accurately identify and exploit state-specific causal relationships in the environment. While some prior works in RL aim to identify these relationships explicitly, they rely on informal domain-specific heuristics such as spatial and temporal proximity. Actual causality offers a principled and general framework for determining the causes of particular events. However, existing definitions of actual cause often attribute causality to a large number of events, even if many of them rarely influence the outcome. Prior work on actual causality proposes normality as a solution to this problem, but its existing implementations are challenging to scale to complex and continuous-valued RL environments. This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes. We additionally introduce Joint Optimization for Actual Cause Inference (JACI), an algorithm that learns from observational data to infer functional actual causes. We demonstrate empirically that FAC agrees with known results on a suite of examples from the actual causality literature, and JACI identifies actual causes with significantly higher accuracy than existing heuristic methods in a set of complex, continuous-valued environments.
- [136] arXiv:2404.10889 [ pdf , ps , other ]
-
Title: Cognitive-Motor Integration in Assessing Bimanual Motor SkillsComments: 12 pages, 3 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: Accurate assessment of bimanual motor skills is essential across various professions, yet, traditional methods often rely on subjective assessments or focus solely on motor actions, overlooking the integral role of cognitive processes. This study introduces a novel approach by leveraging deep neural networks (DNNs) to analyze and integrate both cognitive decision-making and motor execution. We tested this methodology by assessing laparoscopic surgery skills within the Fundamentals of Laparoscopic Surgery program, which is a prerequisite for general surgery certification. Utilizing video capture of motor actions and non-invasive functional near-infrared spectroscopy (fNIRS) for measuring neural activations, our approach precisely classifies subjects by expertise level and predicts FLS behavioral performance scores, significantly surpassing traditional single-modality assessments.
- [137] arXiv:2404.10890 [ pdf , ps , other ]
-
Title: Exploring Augmentation and Cognitive Strategies for AI based Synthetic PersonaeComments: This paper was accepted for publication: Proceedings of ACM Conf on Human Factors in Computing Systems (CHI 24), Rafael Arias Gonzalez, Steve DiPaola. Exploring Augmentation and Cognitive Strategies for Synthetic Personae. ACM SigCHI, in Challenges and Opportunities of LLM-Based Synthetic Personae and Data in HCI Workshop, 2024Subjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
Abstract: Large language models (LLMs) hold potential for innovative HCI research, including the creation of synthetic personae. However, their black-box nature and propensity for hallucinations pose challenges. To address these limitations, this position paper advocates for using LLMs as data augmentation systems rather than zero-shot generators. We further propose the development of robust cognitive and memory frameworks to guide LLM responses. Initial explorations suggest that data enrichment, episodic memory, and self-reflection techniques can improve the reliability of synthetic personae and open up new avenues for HCI research.
- [138] arXiv:2404.10901 [ pdf , ps , html , other ]
-
Title: CrossGP: Cross-Day Glucose Prediction Excluding Physiological InformationSubjects: Artificial Intelligence (cs.AI)
Abstract: The increasing number of diabetic patients is a serious issue in society today, which has significant negative impacts on people's health and the country's financial expenditures. Because diabetes may develop into potential serious complications, early glucose prediction for diabetic patients is necessary for timely medical treatment. Existing glucose prediction methods typically utilize patients' private data (e.g. age, gender, ethnicity) and physiological parameters (e.g. blood pressure, heart rate) as reference features for glucose prediction, which inevitably leads to privacy protection concerns. Moreover, these models generally focus on either long-term (monthly-based) or short-term (minute-based) predictions. Long-term prediction methods are generally inaccurate because of the external uncertainties that can greatly affect the glucose values, while short-term ones fail to provide timely medical guidance. Based on the above issues, we propose CrossGP, a novel machine-learning framework for cross-day glucose prediction solely based on the patient's external activities without involving any physiological parameters. Meanwhile, we implement three baseline models for comparison. Extensive experiments on Anderson's dataset strongly demonstrate the superior performance of CrossGP and prove its potential for future real-life applications.
- [139] arXiv:2404.10906 [ pdf , ps , html , other ]
-
Title: Towards a Research Community in Interpretable Reinforcement Learning: the InterpPol WorkshopSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
Abstract: Embracing the pursuit of intrinsically explainable reinforcement learning raises crucial questions: what distinguishes explainability from interpretability? Should explainable and interpretable agents be developed outside of domains where transparency is imperative? What advantages do interpretable policies offer over neural networks? How can we rigorously define and measure interpretability in policies, without user studies? What reinforcement learning paradigms,are the most suited to develop interpretable agents? Can Markov Decision Processes integrate interpretable state representations? In addition to motivate an Interpretable RL community centered around the aforementioned questions, we propose the first venue dedicated to Interpretable RL: the InterpPol Workshop.
- [140] arXiv:2404.10907 [ pdf , ps , html , other ]
-
Title: Causal Effect Estimation Using Random Hyperplane TessellationsComments: At CLeaR 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Matching is one of the simplest approaches for estimating causal effects from observational data. Matching techniques compare the observed outcomes across pairs of individuals with similar covariate values but different treatment statuses in order to estimate causal effects. However, traditional matching techniques are unreliable given high-dimensional covariates due to the infamous curse of dimensionality. To overcome this challenge, we propose a simple, fast, yet highly effective approach to matching using Random Hyperplane Tessellations (RHPT). First, we prove that the RHPT representation is an approximate balancing score -- thus maintaining the strong ignorability assumption -- and provide empirical evidence for this claim. Second, we report results of extensive experiments showing that matching using RHPT outperforms traditional matching techniques and is competitive with state-of-the-art deep learning methods for causal effect estimation. In addition, RHPT avoids the need for computationally expensive training of deep neural networks.
- [141] arXiv:2404.10933 [ pdf , ps , html , other ]
-
Title: LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMsTaeho Kim , Yanming Wang , Vatshank Chaturvedi , Lokesh Gupta , Seyeon Kim , Yongin Kwon , Sangtae HaComments: 9 pages, 9 figures, accepted to IJCAI 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
- [142] arXiv:2404.10991 [ pdf , ps , other ]
-
Title: Function Approximation for Reinforcement Learning Controller for Energy from Spread WavesSoumyendu Sarkar , Vineet Gundecha , Sahand Ghorbanpour , Alexander Shmakov , Ashwin Ramesh Babu , Avisek Naug , Alexandre Pichard , Mathieu CochoComments: IJCAI 2023, Proceedings of the Thirty-Second International Joint Conference on Artificial IntelligenceAugust 2023Journal-ref: IJCAI 2023, Proceedings of the Thirty-Second International Joint Conference on Artificial IntelligenceAugust 2023, Article No 688, Pages 6201 to 6209Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: The industrial multi-generator Wave Energy Converters (WEC) must handle multiple simultaneous waves coming from different directions called spread waves. These complex devices in challenging circumstances need controllers with multiple objectives of energy capture efficiency, reduction of structural stress to limit maintenance, and proactive protection against high waves. The Multi-Agent Reinforcement Learning (MARL) controller trained with the Proximal Policy Optimization (PPO) algorithm can handle these complexities. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of a fully connected neural network (FCN), LSTM, and Transformer model variants with varying depths and gated residual connections. Our results show that the transformer model of moderate depth with gated residual connections around the multi-head attention, multi-layer perceptron, and the transformer block (STrXL) proposed in this paper is optimal and boosts energy efficiency by an average of 22.1% for these complex spread waves over the existing spring damper (SD) controller. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion for angled waves. Demo: this https URL
- [143] arXiv:2404.11027 [ pdf , ps , html , other ]
-
Title: Empowering Large Language Models on Robotic Manipulation with Affordance PromptingSubjects: Artificial Intelligence (cs.AI)
Abstract: While large language models (LLMs) are successful in completing various language processing tasks, they easily fail to interact with the physical world by generating control sequences properly. We find that the main reason is that LLMs are not grounded in the physical world. Existing LLM-based approaches circumvent this problem by relying on additional pre-defined skills or pre-trained sub-policies, making it hard to adapt to new tasks. In contrast, we aim to address this problem and explore the possibility to prompt pre-trained LLMs to accomplish a series of robotic manipulation tasks in a training-free paradigm. Accordingly, we propose a framework called LLM+A(ffordance) where the LLM serves as both the sub-task planner (that generates high-level plans) and the motion controller (that generates low-level control sequences). To ground these plans and control sequences on the physical world, we develop the affordance prompting technique that stimulates the LLM to 1) predict the consequences of generated plans and 2) generate affordance values for relevant objects. Empirically, we evaluate the effectiveness of LLM+A in various language-conditioned robotic manipulation tasks, which show that our approach substantially improves performance by enhancing the feasibility of generated plans and control and can easily generalize to different environments.
- [144] arXiv:2404.11041 [ pdf , ps , html , other ]
-
Title: On the Empirical Complexity of Reasoning and Planning in LLMsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) work surprisingly well for some complex reasoning problems via chain-of-thought (CoT) or tree-of-thought (ToT), but the underlying reasons remain unclear. We seek to understand the performance of these methods by conducting experimental case studies and linking the outcomes to sample and computational complexity in machine learning. We found that if problems can be decomposed into a sequence of reasoning steps and learning to predict the next step has a low sample and computational complexity, explicitly outlining the reasoning chain with all necessary information for predicting the next step may improve performance. Conversely, for problems where predicting the next step is computationally hard, adopting ToT may yield better reasoning outcomes than attempting to formulate a short reasoning chain.
- [145] arXiv:2404.11046 [ pdf , ps , html , other ]
-
Title: Lightweight Unsupervised Federated Learning with Pretrained Vision Language ModelSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Federated learning aims to tackle the ``isolated data island" problem, where it trains a collective model from physically isolated clients while safeguarding the privacy of users' data. However, supervised federated learning necessitates that each client labels their data for training, which can be both time-consuming and resource-intensive, and may even be impractical for edge devices. Moreover, the training and transmission of deep models present challenges to the computation and communication capabilities of the clients. To address these two inherent challenges in supervised federated learning, we propose a novel lightweight unsupervised federated learning approach that leverages unlabeled data on each client to perform lightweight model training and communication by harnessing pretrained vision-language models, such as CLIP. By capitalizing on the zero-shot prediction capability and the well-trained image encoder of the pre-trained CLIP model, we have carefully crafted an efficient and resilient self-training approach. This method refines the initial zero-shot predicted pseudo-labels of unlabeled instances through the sole training of a linear classifier on top of the fixed image encoder. Additionally, to address data heterogeneity within each client, we propose a class-balanced text feature sampling strategy for generating synthetic instances in the feature space to support local training. Experiments are conducted on multiple benchmark datasets. The experimental results demonstrate that our proposed method greatly enhances model performance in comparison to CLIP's zero-shot predictions and even outperforms supervised federated learning benchmark methods given limited computational and communication overhead.
- [146] arXiv:2404.11122 [ pdf , ps , html , other ]
-
Title: Small Language Models are Good Too: An Empirical Study of Zero-Shot ClassificationPierre Lepagnol (LISN), Thomas Gerald (LISN), Sahar Ghannay (LISN), Christophe Servan (STL, ILES), Sophie Rosset (LISN)Journal-ref: LREC-COLING 2024, May 2024, TURIN, ItalySubjects: Artificial Intelligence (cs.AI)
Abstract: This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts.We developed and shared a comprehensive open-source repository that encapsulates our methodologies. This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.
- [147] arXiv:2404.11144 [ pdf , ps , html , other ]
-
Title: Self-adaptive PSRO: Towards an Automatic Population-based Game SolverComments: Accepted to 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024)Subjects: Artificial Intelligence (cs.AI) ; Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Abstract: Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games. However, the hand-crafted hyperparameter value selection in most of the existing works requires extensive domain knowledge, forming the main barrier to applying PSRO to different games. In this work, we make the first attempt to investigate the possibility of self-adaptively determining the optimal hyperparameter values in the PSRO framework. Our contributions are three-fold: (1) Using several hyperparameters, we propose a parametric PSRO that unifies the gradient descent ascent (GDA) and different PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the hyperparameter value selection of the parametric PSRO as a hyperparameter optimization (HPO) problem where our objective is to learn an HPO policy that can self-adaptively determine the optimal hyperparameter values during the running of the parametric PSRO. (3) To overcome the poor performance of online HPO methods, we propose a novel offline HPO approach to optimize the HPO policy based on the Transformer architecture. Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.
- [148] arXiv:2404.11148 [ pdf , ps , html , other ]
-
Title: Explainable Machine Learning System for Predicting Chronic Kidney Disease in High-Risk Cardiovascular PatientsComments: 9 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: As the global population ages, the incidence of Chronic Kidney Disease (CKD) is rising. CKD often remains asymptomatic until advanced stages, which significantly burdens both the healthcare system and patient quality of life. This research developed an explainable machine learning system for predicting CKD in patients with cardiovascular risks, utilizing medical history and laboratory data. The Random Forest model achieved the highest sensitivity of 88.2%. The study introduces a comprehensive explainability framework that extends beyond traditional feature importance methods, incorporating global and local interpretations, bias inspection, biomedical relevance, and safety assessments. Key predictive features identified in global interpretation were the use of diabetic and ACEI/ARB medications, and initial eGFR values. Local interpretation provided model insights through counterfactual explanations, which aligned with other system parts. After conducting a bias inspection, it was found that the initial eGFR values and CKD predictions exhibited some bias, but no significant gender bias was identified. The model's logic, extracted by scoped rules, was confirmed to align with existing medical literature. The safety assessment tested potentially dangerous cases and confirmed that the model behaved safely. This system enhances the explainability, reliability, and accountability of the model, promoting its potential integration into healthcare settings and compliance with upcoming regulatory standards, and showing promise for broader applications in healthcare machine learning.
- [149] arXiv:2404.11160 [ pdf , ps , html , other ]
-
Title: Low-Cost Language Models: Survey and Performance Evaluation on Python Code GenerationJessica López Espejel , Mahaman Sanoussi Yahaya Alassan , Merieme Bouhandi , Walid Dahhane , El Hassane EttifouriComments: Under review at Elsevier's Engineering Applications of Artificial IntelligenceSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have become the go-to solution for many Natural Language Processing (NLP) tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.
- [150] arXiv:2404.11208 [ pdf , ps , html , other ]
-
Title: CAGE: Causality-Aware Shapley Value for Global ExplanationsSubjects: Artificial Intelligence (cs.AI)
Abstract: As Artificial Intelligence (AI) is having more influence on our everyday lives, it becomes important that AI-based decisions are transparent and explainable. As a consequence, the field of eXplainable AI (or XAI) has become popular in recent years. One way to explain AI models is to elucidate the predictive importance of the input features for the AI model in general, also referred to as global explanations. Inspired by cooperative game theory, Shapley values offer a convenient way for quantifying the feature importance as explanations. However many methods based on Shapley values are built on the assumption of feature independence and often overlook causal relations of the features which could impact their importance for the ML model. Inspired by studies of explanations at the local level, we propose CAGE (Causally-Aware Shapley Values for Global Explanations). In particular, we introduce a novel sampling procedure for out-coalition features that respects the causal relations of the input features. We derive a practical approach that incorporates causal knowledge into global explanation and offers the possibility to interpret the predictive feature importance considering their causal relation. We evaluate our method on synthetic data and real-world data. The explanations from our approach suggest that they are not only more intuitive but also more faithful compared to previous global explanation methods.
- [151] arXiv:2404.11209 [ pdf , ps , html , other ]
-
Title: Prompt-Guided Generation of Structured Chest X-Ray Report Using a Pre-trained LLMComments: Accepted by IEEE Conference on Multimedia Expo 2024Subjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract: Medical report generation automates radiology descriptions from images, easing the burden on physicians and minimizing errors. However, current methods lack structured outputs and physician interactivity for clear, clinically relevant reports. Our method introduces a prompt-guided approach to generate structured chest X-ray reports using a pre-trained large language model (LLM). First, we identify anatomical regions in chest X-rays to generate focused sentences that center on key visual elements, thereby establishing a structured report foundation with anatomy-based sentences. We also convert the detected anatomy into textual prompts conveying anatomical comprehension to the LLM. Additionally, the clinical context prompts guide the LLM to emphasize interactivity and clinical requirements. By integrating anatomy-focused sentences and anatomy/clinical prompts, the pre-trained LLM can generate structured chest X-ray reports tailored to prompted anatomical regions and clinical contexts. We evaluate using language generation and clinical effectiveness metrics, demonstrating strong performance.
- [152] arXiv:2404.11276 [ pdf , ps , html , other ]
-
Title: RD2Bench: Toward Data-Centric Automatic R&DComments: 17 pages, 5 figures,Subjects: Artificial Intelligence (cs.AI) ; General Finance (q-fin.GN)
Abstract: The progress of humanity is driven by those successful discoveries accompanied by countless failed experiments. Researchers often seek the potential research directions by reading and then verifying them through experiments. The process imposes a significant burden on researchers. In the past decade, the data-driven black-box deep learning method demonstrates its effectiveness in a wide range of real-world scenarios, which exacerbates the experimental burden of researchers and thus renders the potential successful discoveries veiled. Therefore, automating such a research and development (R&D) process is an urgent need. In this paper, we serve as the first effort to formalize the goal by proposing a Real-world Data-centric automatic R&D Benchmark, namely RD2Bench. RD2Bench benchmarks all the operations in data-centric automatic R&D (D-CARD) as a whole to navigate future work toward our goal directly. We focuses on evaluating the interaction and synergistic effects of various model capabilities and aiding to select the well-performed trustworthy models. Although RD2Bench is very challenging to the state-of-the-art (SOTA) large language model (LLM) named GPT-4, indicating ample research opportunities and more research efforts, LLMs possess promising potential to bring more significant development to D-CARD: They are able to implement some simple methods without adopting any additional techniques. We appeal to future work to take developing techniques for tackling automatic R&D into consideration, thus bringing the opportunities of the potential revolutionary upgrade to human productivity.
- [153] arXiv:2404.11290 [ pdf , ps , html , other ]
-
Title: Inductive Cognitive Diagnosis for Fast Student Learning in Web-Based Online Intelligent Education SystemsComments: WWW 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Cognitive diagnosis aims to gauge students' mastery levels based on their response logs. Serving as a pivotal module in web-based online intelligent education systems (WOIESs), it plays an upstream and fundamental role in downstream tasks like learning item recommendation and computerized adaptive testing. WOIESs are open learning environment where numerous new students constantly register and complete exercises. In WOIESs, efficient cognitive diagnosis is crucial to fast feedback and accelerating student learning. However, the existing cognitive diagnosis methods always employ intrinsically transductive student-specific embeddings, which become slow and costly due to retraining when dealing with new students who are unseen during training. To this end, this paper proposes an inductive cognitive diagnosis model (ICDM) for fast new students' mastery levels inference in WOIESs. Specifically, in ICDM, we propose a novel student-centered graph (SCG). Rather than inferring mastery levels through updating student-specific embedding, we derive the inductive mastery levels as the aggregated outcomes of students' neighbors in SCG. Namely, SCG enables to shift the task from finding the most suitable student-specific embedding that fits the response logs to finding the most suitable representations for different node types in SCG, and the latter is more efficient since it no longer requires retraining. To obtain this representation, ICDM consists of a construction-aggregation-generation-transformation process to learn the final representation of students, exercises and concepts. Extensive experiments across real-world datasets show that, compared with the existing cognitive diagnosis methods that are always transductive, ICDM is much more faster while maintains the competitive inference performance for new students.
- [154] arXiv:2404.11296 [ pdf , ps , html , other ]
-
Title: How to Exhibit More Predictable BehaviorsComments: 10 pages, 7 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper looks at predictability problems, i.e., wherein an agent must choose its strategy in order to optimize the predictions that an external observer could make. We address these problems while taking into account uncertainties on the environment dynamics and on the observed agent's policy. To that end, we assume that the observer 1. seeks to predict the agent's future action or state at each time step, and 2. models the agent using a stochastic policy computed from a known underlying problem, and we leverage on the framework of observer-aware Markov decision processes (OAMDPs). We propose action and state predictability performance criteria through reward functions built on the observer's belief about the agent policy; show that these induced predictable OAMDPs can be represented by goal-oriented or discounted MDPs; and analyze the properties of the proposed reward functions both theoretically and empirically on two types of grid-world problems.
- [155] arXiv:2404.11341 [ pdf , ps , html , other ]
-
Title: The Causal Chambers: Real Physical Systems as a Testbed for AI MethodologyComments: 38 pages, 17 figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Abstract: In some fields of AI, machine learning and statistics, the validation of new methods and algorithms is often hindered by the scarcity of suitable real-world datasets. Researchers must often turn to simulated data, which yields limited information about the applicability of the proposed methods to real problems. As a step forward, we have constructed two devices that allow us to quickly and inexpensively produce large datasets from non-trivial but well-understood physical systems. The devices, which we call causal chambers, are computer-controlled laboratories that allow us to manipulate and measure an array of variables from these physical systems, providing a rich testbed for algorithms from a variety of fields. We illustrate potential applications through a series of case studies in fields such as causal discovery, out-of-distribution generalization, change point detection, independent component analysis, and symbolic regression. For applications to causal inference, the chambers allow us to carefully perform interventions. We also provide and empirically validate a causal model of each chamber, which can be used as ground truth for different tasks. All hardware and software is made open source, and the datasets are publicly available at this http URL or through the Python package causalchamber.
- [156] arXiv:2404.11408 [ pdf , ps , html , other ]
-
Title: DUPE: Detection Undermining via Prompt Engineering for Deepfake TextComments: 10 pages, 2 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: As large language models (LLMs) become increasingly commonplace, concern about distinguishing between human and AI text increases as well. The growing power of these models is of particular concern to teachers, who may worry that students will use LLMs to write school assignments. Facing a technology with which they are unfamiliar, teachers may turn to publicly-available AI text detectors. Yet the accuracy of many of these detectors has not been thoroughly verified, posing potential harm to students who are falsely accused of academic dishonesty. In this paper, we evaluate three different AI text detectors-Kirchenbauer et al. watermarks, ZeroGPT, and GPTZero-against human and AI-generated essays. We find that watermarking results in a high false positive rate, and that ZeroGPT has both high false positive and false negative rates. Further, we are able to significantly increase the false negative rate of all detectors by using ChatGPT 3.5 to paraphrase the original AI-generated texts, thereby effectively bypassing the detectors.
- [157] arXiv:2404.11431 [ pdf , ps , html , other ]
-
Title: Instantiations and Computational Aspects of Non-Flat Assumption-based ArgumentationSubjects: Artificial Intelligence (cs.AI)
Abstract: Most existing computational tools for assumption-based argumentation (ABA) focus on so-called flat frameworks, disregarding the more general case. In this paper, we study an instantiation-based approach for reasoning in possibly non-flat ABA. We make use of a semantics-preserving translation between ABA and bipolar argumentation frameworks (BAFs). By utilizing compilability theory, we establish that the constructed BAFs will in general be of exponential size. In order to keep the number of arguments and computational cost low, we present three ways of identifying redundant arguments. Moreover, we identify fragments of ABA which admit a poly-sized instantiation. We propose two algorithmic approaches for reasoning in possibly non-flat ABA. The first approach utilizes the BAF instantiation while the second works directly without constructing arguments. An empirical evaluation shows that the former outperforms the latter on many instances, reflecting the lower complexity of BAF reasoning. This result is in contrast to flat ABA, where direct approaches dominate instantiation-based approaches.
- [158] arXiv:2404.11443 [ pdf , ps , other ]
-
Title: Prediction of Unmanned Surface Vessel Motion Attitude Based on CEEMDAN-PSO-SVMSubjects: Artificial Intelligence (cs.AI)
Abstract: Unmanned boats, while navigating at sea, utilize active compensation systems to mitigate wave disturbances experienced by onboard instruments and equipment. However, there exists a lag in the measurement of unmanned boat attitudes, thus introducing unmanned boat motion attitude prediction to compensate for the lag in the signal acquisition process. This paper, based on the basic principles of waves, derives the disturbance patterns of waves on unmanned boats from the wave energy spectrum. Through simulation analysis of unmanned boat motion attitudes, motion attitude data is obtained, providing experimental data for subsequent work. A combined prediction model based on Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN), Particle Swarm Optimization (PSO), and Support Vector Machine (SVM) is designed to predict the motion attitude of unmanned boats. Simulation results validate its superior prediction accuracy compared to traditional prediction models. For example, in terms of mean absolute error, it improves by 17% compared to the EMD-PSO-SVM model.
- [159] arXiv:2404.11447 [ pdf , ps , other ]
-
Title: Research on emotionally intelligent dialogue generation based on automatic dialogue systemSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Automated dialogue systems are important applications of artificial intelligence, and traditional systems struggle to understand user emotions and provide empathetic feedback. This study integrates emotional intelligence technology into automated dialogue systems and creates a dialogue generation model with emotional intelligence through deep learning and natural language processing techniques. The model can detect and understand a wide range of emotions and specific pain signals in real time, enabling the system to provide empathetic interaction. By integrating the results of the study "Can artificial intelligence detect pain and express pain empathy?", the model's ability to understand the subtle elements of pain empathy has been enhanced, setting higher standards for emotional intelligence dialogue systems. The project aims to provide theoretical understanding and practical suggestions to integrate advanced emotional intelligence capabilities into dialogue systems, thereby improving user experience and interaction quality.
- [160] arXiv:2404.11458 [ pdf , ps , html , other ]
-
Title: Learn to Tour: Operator Design For Solution Feasibility Mapping in Pickup-and-delivery Traveling Salesman ProblemSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper aims to develop a learning method for a special class of traveling salesman problems (TSP), namely, the pickup-and-delivery TSP (PDTSP), which finds the shortest tour along a sequence of one-to-one pickup-and-delivery nodes. One-to-one here means that the transported people or goods are associated with designated pairs of pickup and delivery nodes, in contrast to that indistinguishable goods can be delivered to any nodes. In PDTSP, precedence constraints need to be satisfied that each pickup node must be visited before its corresponding delivery node. Classic operations research (OR) algorithms for PDTSP are difficult to scale to large-sized problems. Recently, reinforcement learning (RL) has been applied to TSPs. The basic idea is to explore and evaluate visiting sequences in a solution space. However, this approach could be less computationally efficient, as it has to potentially evaluate many infeasible solutions of which precedence constraints are violated. To restrict solution search within a feasible space, we utilize operators that always map one feasible solution to another, without spending time exploring the infeasible solution space. Such operators are evaluated and selected as policies to solve PDTSPs in an RL framework. We make a comparison of our method and baselines, including classic OR algorithms and existing learning methods. Results show that our approach can find tours shorter than baselines.
- [161] arXiv:2404.11476 [ pdf , ps , html , other ]
-
Title: Taxonomy to Regulation: A (Geo)Political Taxonomy for AI Risks and Regulatory Measures in the EU AI ActSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Technological innovations have shown remarkable capabilities to benefit and harm society alike. AI constitutes a democratized sophisticated technology accessible to large parts of society, including malicious actors. This work proposes a taxonomy focusing on on (geo)political risks associated with AI. It identifies 12 risks in total divided into four categories: (1) Geopolitical Pressures, (2) Malicious Usage, (3) Environmental, Social, and Ethical Risks, and (4) Privacy and Trust Violations. Incorporating a regulatory side, this paper conducts a policy assessment of the EU AI Act. Adopted in March 2023, the landmark regulation has the potential to have a positive top-down impact concerning AI risk reduction but needs regulatory adjustments to mitigate risks more comprehensively. Regulatory exceptions for open-source models, excessively high parameters for the classification of GPAI models as a systemic risk, and the exclusion of systems designed exclusively for military purposes from the regulation's obligations leave room for future action.
- [162] arXiv:2404.11483 [ pdf , ps , html , other ]
-
Title: AgentKit: Flow Engineering with Graphs, not CodingYue Wu , Yewen Fan , So Yeon Min , Shrimai Prabhumoye , Stephen McAleer , Yonatan Bisk , Ruslan Salakhutdinov , Yuanzhi Li , Tom MitchellSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. The chains of nodes can be designed to explicitly enforce a naturally structured "thought process". For example, for the task of writing a paper, one may start with the thought process of 1) identify a core message, 2) identify prior research gaps, etc. The nodes in AgentKit can be designed and combined in different ways to implement multiple advanced capabilities including on-the-fly hierarchical planning, reflection, and learning from interactions. In addition, due to the modular nature and the intuitive design to simulate explicit human thought process, a basic agent could be implemented as simple as a list of prompts for the subtasks and therefore could be designed and tuned by someone without any programming experience. Quantitatively, we show that agents designed through AgentKit achieve SOTA performance on WebShop and Crafter. These advances underscore AgentKit's potential in making LLM agents effective and accessible for a wider range of applications. this https URL
- [163] arXiv:2404.11515 [ pdf , ps , other ]
-
Title: Embedding Privacy in Computational Social Science and Artificial Intelligence ResearchComments: 2024Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
Abstract: Privacy is a human right. It ensures that individuals are free to engage in discussions, participate in groups, and form relationships online or offline without fear of their data being inappropriately harvested, analyzed, or otherwise used to harm them. Preserving privacy has emerged as a critical factor in research, particularly in the computational social science (CSS), artificial intelligence (AI) and data science domains, given their reliance on individuals' data for novel insights. The increasing use of advanced computational models stands to exacerbate privacy concerns because, if inappropriately used, they can quickly infringe privacy rights and lead to adverse effects for individuals - especially vulnerable groups - and society. We have already witnessed a host of privacy issues emerge with the advent of large language models (LLMs), such as ChatGPT, which further demonstrate the importance of embedding privacy from the start. This article contributes to the field by discussing the role of privacy and the primary issues that researchers working in CSS, AI, data science and related domains are likely to face. It then presents several key considerations for researchers to ensure participant privacy is best preserved in their research design, data collection and use, analysis, and dissemination of research results.
- [164] arXiv:2404.11581 [ pdf , ps , html , other ]
-
Title: LLMTune: Accelerate Database Knob Tuning with Large Language ModelsXinmei Huang , Haoyang Li , Jing Zhang , Xinxin Zhao , Zhiming Yao , Yiyan Li , Zhuohao Yu , Tieying Zhang , Hong Chen , Cuiping LiSubjects: Artificial Intelligence (cs.AI) ; Databases (cs.DB)
Abstract: Database knob tuning is a critical challenge in the database community, aiming to optimize knob values to enhance database performance for specific workloads. DBMS often feature hundreds of tunable knobs, posing a significant challenge for DBAs to recommend optimal configurations. Consequently, many machine learning-based tuning methods have been developed to automate this process. Despite the introduction of various optimizers, practical applications have unveiled a new problem: they typically require numerous workload runs to achieve satisfactory performance, a process that is both time-consuming and resource-intensive. This inefficiency largely stems from the optimal configuration often being substantially different from the default setting, necessitating multiple iterations during tuning. Recognizing this, we argue that an effective starting point could significantly reduce redundant exploration in less efficient areas, thereby potentially speeding up the tuning process for the optimizers. Based on this assumption, we introduce LLMTune, a large language model-based configuration generator designed to produce an initial, high-quality configuration for new workloads. These generated configurations can then serve as starting points for various base optimizers, accelerating their tuning processes. To obtain training data for LLMTune's supervised fine-tuning, we have devised a new automatic data generation framework capable of efficiently creating a large number of <workload, configuration> pairs. We have conducted thorough experiments to evaluate LLMTune's effectiveness with different workloads, such as TPC-H and JOB. In comparison to leading methods, LLMTune demonstrates a quicker ability to identify superior configurations. For instance, with the challenging TPC-H workload, our LLMTune achieves a significant 15.6x speed-up ratio in finding the best-performing configurations.
- [165] arXiv:2404.11584 [ pdf , ps , html , other ]
-
Title: The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A SurveyComments: 13 pages,6 figures,38 referencesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This survey paper examines the recent advancements in AI agent implementations, with a focus on their ability to achieve complex goals that require enhanced reasoning, planning, and tool execution capabilities. The primary objectives of this work are to a) communicate the current capabilities and limitations of existing AI agent implementations, b) share insights gained from our observations of these systems in action, and c) suggest important considerations for future developments in AI agent design. We achieve this by providing overviews of single-agent and multi-agent architectures, identifying key patterns and divergences in design choices, and evaluating their overall impact on accomplishing a provided goal. Our contribution outlines key themes when selecting an agentic architecture, the impact of leadership on agent systems, agent communication styles, and key phases for planning, execution, and reflection that enable robust AI agent systems.
- [166] arXiv:2404.11585 [ pdf , ps , html , other ]
-
Title: Spatial Context-based Self-Supervised Learning for Handwritten Text RecognitionSubjects: Artificial Intelligence (cs.AI)
Abstract: Handwritten Text Recognition (HTR) is a relevant problem in computer vision, and implies unique challenges owing to its inherent variability and the rich contextualization required for its interpretation. Despite the success of Self-Supervised Learning (SSL) in computer vision, its application to HTR has been rather scattered, leaving key SSL methodologies unexplored. This work focuses on one of them, namely Spatial Context-based SSL. We investigate how this family of approaches can be adapted and optimized for HTR and propose new workflows that leverage the unique features of handwritten text. Our experiments demonstrate that the methods considered lead to advancements in the state-of-the-art of SSL for HTR in a number of benchmark cases.
- [167] arXiv:2404.11597 [ pdf , ps , other ]
-
Title: Explainable Artificial Intelligence Techniques for Accurate Fault Detection and Diagnosis: A ReviewSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: As the manufacturing industry advances with sensor integration and automation, the opaque nature of deep learning models in machine learning poses a significant challenge for fault detection and diagnosis. And despite the related predictive insights Artificial Intelligence (AI) can deliver, advanced machine learning engines often remain a black box. This paper reviews the eXplainable AI (XAI) tools and techniques in this context. We explore various XAI methodologies, focusing on their role in making AI decision-making transparent, particularly in critical scenarios where humans are involved. We also discuss current limitations and potential future research that aims to balance explainability with model performance while improving trustworthiness in the context of AI applications for critical industrial use cases.
- [168] arXiv:2404.11677 [ pdf , ps , html , other ]
-
Title: Cross-Problem Learning for Solving Vehicle Routing ProblemsZhuoyi Lin , Yaoxin Wu , Bangjian Zhou , Zhiguang Cao , Wen Song , Yingqian Zhang , Senthilnath JayaveluSubjects: Artificial Intelligence (cs.AI)
Abstract: Existing neural heuristics often train a deep architecture from scratch for each specific vehicle routing problem (VRP), ignoring the transferable knowledge across different VRP variants. This paper proposes the cross-problem learning to assist heuristics training for different downstream VRP variants. Particularly, we modularize neural architectures for complex VRPs into 1) the backbone Transformer for tackling the travelling salesman problem (TSP), and 2) the additional lightweight modules for processing problem-specific features in complex VRPs. Accordingly, we propose to pre-train the backbone Transformer for TSP, and then apply it in the process of fine-tuning the Transformer models for each target VRP variant. On the one hand, we fully fine-tune the trained backbone Transformer and problem-specific modules simultaneously. On the other hand, we only fine-tune small adapter networks along with the modules, keeping the backbone Transformer still. Extensive experiments on typical VRPs substantiate that 1) the full fine-tuning achieves significantly better performance than the one trained from scratch, and 2) the adapter-based fine-tuning also delivers comparable performance while being notably parameter-efficient. Furthermore, we empirically demonstrate the favorable effect of our method in terms of cross-distribution application and versatility.
- [169] arXiv:2404.11698 [ pdf , ps , html , other ]
-
Title: A Secure and Trustworthy Network Architecture for Federated Learning Healthcare ApplicationsAntonio Boiano , Marco Di Gennaro , Luca Barbieri , Michele Carminati , Monica Nicoli , Alessandro Redondi , Stefano Savazzi , Albert Sund Aillet , Diogo Reis Santos , Luigi SerioSubjects: Artificial Intelligence (cs.AI) ; Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated Learning (FL) has emerged as a promising approach for privacy-preserving machine learning, particularly in sensitive domains such as healthcare. In this context, the TRUSTroke project aims to leverage FL to assist clinicians in ischemic stroke prediction. This paper provides an overview of the TRUSTroke FL network infrastructure. The proposed architecture adopts a client-server model with a central Parameter Server (PS). We introduce a Docker-based design for the client nodes, offering a flexible solution for implementing FL processes in clinical settings. The impact of different communication protocols (HTTP or MQTT) on FL network operation is analyzed, with MQTT selected for its suitability in FL scenarios. A control plane to support the main operations required by FL processes is also proposed. The paper concludes with an analysis of security aspects of the FL architecture, addressing potential threats and proposing mitigation strategies to increase the trustworthiness level.
- [170] arXiv:2404.11706 [ pdf , ps , html , other ]
-
Title: Pretraining Billion-scale Geospatial Foundational Models on FrontierSubjects: Artificial Intelligence (cs.AI)
Abstract: As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.
- [171] arXiv:2404.11714 [ pdf , ps , other ]
-
Title: Implementation and Evaluation of a Gradient Descent-Trained Defensible Blackboard Architecture SystemSubjects: Artificial Intelligence (cs.AI)
Abstract: A variety of forms of artificial intelligence systems have been developed. Two well-known techniques are neural networks and rule-fact expert systems. The former can be trained from presented data while the latter is typically developed by human domain experts. A combined implementation that uses gradient descent to train a rule-fact expert system has been previously proposed. A related system type, the Blackboard Architecture, adds an actualization capability to expert systems. This paper proposes and evaluates the incorporation of a defensible-style gradient descent training capability into the Blackboard Architecture. It also introduces the use of activation functions for defensible artificial intelligence systems and implements and evaluates a new best path-based training algorithm.
- [172] arXiv:2404.11716 [ pdf , ps , html , other ]
-
Title: A Survey on Semantic Modeling for Building Energy ManagementComments: 29 pages, 6 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: Buildings account for a substantial portion of global energy consumption. Reducing buildings' energy usage primarily involves obtaining data from building systems and environment, which are instrumental in assessing and optimizing the building's performance. However, as devices from various manufacturers represent their data in unique ways, this disparity introduces challenges for semantic interoperability and creates obstacles in developing scalable building applications. This survey explores the leading semantic modeling techniques deployed for energy management in buildings. Furthermore, it aims to offer tangible use cases for applying semantic models, shedding light on the pivotal concepts and limitations intrinsic to each model. Our findings will assist researchers in discerning the appropriate circumstances and methodologies for employing these models in various use cases.
- [173] arXiv:2404.11720 [ pdf , ps , html , other ]
-
Title: GEOBIND: Binding Text, Image, and Audio through Satellite ImagesComments: 2024 IEEE International Geoscience and Remote Sensing SymposiumSubjects: Artificial Intelligence (cs.AI)
Abstract: In remote sensing, we are interested in modeling various modalities for some geographic location. Several works have focused on learning the relationship between a location and type of landscape, habitability, audio, textual descriptions, etc. Recently, a common way to approach these problems is to train a deep-learning model that uses satellite images to infer some unique characteristics of the location. In this work, we present a deep-learning model, GeoBind, that can infer about multiple modalities, specifically text, image, and audio, from satellite imagery of a location. To do this, we use satellite images as the binding element and contrastively align all other modalities to the satellite image data. Our training results in a joint embedding space with multiple types of data: satellite image, ground-level image, audio, and text. Furthermore, our approach does not require a single complex dataset that contains all the modalities mentioned above. Rather it only requires multiple satellite-image paired data. While we only align three modalities in this paper, we present a general framework that can be used to create an embedding space with any number of modalities by using satellite images as the binding element. Our results show that, unlike traditional unimodal models, GeoBind is versatile and can reason about multiple modalities for a given satellite image input.
- [174] arXiv:2404.11742 [ pdf , ps , html , other ]
-
Title: Meta-Decomposition: Dynamic Segmentation Approach Selection in IoT-based Activity RecognitionSubjects: Artificial Intelligence (cs.AI)
Abstract: Internet of Things (IoT) devices generate heterogeneous data over time; and relying solely on individual data points is inadequate for accurate analysis.
Segmentation is a common preprocessing step in many IoT applications, including IoT-based activity recognition, aiming to address the limitations of individual events and streamline the process. However, this step introduces at least two families of uncontrollable biases. The first is caused by the changes made by the segmentation process on the initial problem space, such as dividing the input data into 60 seconds windows. The second category of biases results from the segmentation process itself, including the fixation of the segmentation method and its parameters.
To address these biases, we propose to redefine the segmentation problem as a special case of a decomposition problem, including three key components: a decomposer, resolutions, and a composer.
The inclusion of the composer task in the segmentation process facilitates an assessment of the relationship between the original problem and the problem after the segmentation. Therefore, It leads to an improvement in the evaluation process and, consequently, in the selection of the appropriate segmentation method.
Then, we formally introduce our novel meta-decomposition or learning-to-decompose approach. It reduces the segmentation biases by considering the segmentation as a hyperparameter to be optimized by the outer learning problem. Therefore, meta-decomposition improves the overall system performance by dynamically selecting the appropriate segmentation method without including the mentioned biases. Extensive experiments on four real-world datasets demonstrate the effectiveness of our proposal. - [175] arXiv:2404.11744 [ pdf , ps , html , other ]
-
Title: Incremental Bootstrapping and Classification of Structured Scenes in a Fuzzy OntologySubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Logic in Computer Science (cs.LO); Robotics (cs.RO)
Abstract: We foresee robots that bootstrap knowledge representations and use them for classifying relevant situations and making decisions based on future observations. Particularly for assistive robots, the bootstrapping mechanism might be supervised by humans who should not repeat a training phase several times and should be able to refine the taught representation. We consider robots that bootstrap structured representations to classify some intelligible categories. Such a structure should be incrementally bootstrapped, i.e., without invalidating the identified category models when a new additional category is considered. To tackle this scenario, we presented the Scene Identification and Tagging (SIT) algorithm, which bootstraps structured knowledge representation in a crisp OWL-DL ontology. Over time, SIT bootstraps a graph representing scenes, sub-scenes and similar scenes. Then, SIT can classify new scenes within the bootstrapped graph through logic-based reasoning. However, SIT has issues with sensory data because its crisp implementation is not robust to perception noises. This paper presents a reformulation of SIT within the fuzzy domain, which exploits a fuzzy DL ontology to overcome the robustness issues. By comparing the performances of fuzzy and crisp implementations of SIT, we show that fuzzy SIT is robust, preserves the properties of its crisp formulation, and enhances the bootstrapped representations. On the contrary, the fuzzy implementation of SIT leads to less intelligible knowledge representations than the one bootstrapped in the crisp domain.
- [176] arXiv:2404.11792 [ pdf , ps , html , other ]
-
Title: Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative StudyZooey Nguyen , Anthony Annunziata , Vinh Luong , Sang Dinh , Quynh Le , Anh Hai Ha , Chanh Le , Hong An Phan , Shruti Raghavan , Christopher NguyenComments: Fixed typo of OODA's score on harder-question set in Table 2Subjects: Artificial Intelligence (cs.AI)
Abstract: This paper investigates the impact of domain-specific model fine-tuning and of reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG). Using the FinanceBench SEC financial filings dataset, we observe that, for RAG, combining a fine-tuned embedding model with a fine-tuned LLM achieves better accuracy than generic models, with relatively greater gains attributable to fine-tuned embedding models. Additionally, employing reasoning iterations on top of RAG delivers an even bigger jump in performance, enabling the Q&A systems to get closer to human-expert quality. We discuss the implications of such findings, propose a structured technical design space capturing major technical components of Q&A AI, and provide recommendations for making high-impact technical choices for such components. We plan to follow up on this work with actionable guides for AI teams and further investigations into the impact of domain-specific augmentation in RAG and into agentic AI capabilities such as advanced planning and reasoning.
- [177] arXiv:2404.11833 [ pdf , ps , html , other ]
-
Title: Planning with Language Models Through The Lens of EfficiencySubjects: Artificial Intelligence (cs.AI)
Abstract: We analyse the cost of using LLMs for planning and highlight that recent trends are profoundly uneconomical. We propose a significantly more efficient approach and argue for a responsible use of compute resources; urging research community to investigate LLM-based approaches that upholds efficiency.
- [178] arXiv:2404.11835 [ pdf , ps , html , other ]
-
Title: CAUS: A Dataset for Question Generation based on Human Cognition Leveraging Large Language ModelsComments: 8 pages, 4 figures and 3 tables. This work has been accepted for presentation at CogSci 2024 and is currently under revisionSubjects: Artificial Intelligence (cs.AI)
Abstract: We introduce the CAUS (Curious About Uncertain Scene) dataset, designed to enable Large Language Models, specifically GPT-4, to emulate human cognitive processes for resolving uncertainties. Leveraging this dataset, we investigate the potential of LLMs to engage in questioning effectively. Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries. The queries are then classified according to multi-dimensional criteria. All procedures are facilitated by a collaborative system involving both LLMs and human researchers. Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions. The study suggests that incorporating human-like questioning into AI models improves their ability to manage uncertainties, paving the way for future advancements in Artificial Intelligence (AI).
- [179] arXiv:2404.11854 [ pdf , ps , html , other ]
-
Title: SGRU: A High-Performance Structured Gated Recurrent Unit for Traffic Flow PredictionComments: 7 pages, 6 figures, conferenceSubjects: Artificial Intelligence (cs.AI)
Abstract: Traffic flow prediction is an essential task in constructing smart cities and is a typical Multivariate Time Series (MTS) Problem. Recent research has abandoned Gated Recurrent Units (GRU) and utilized dilated convolutions or temporal slicing for feature extraction, and they have the following drawbacks: (1) Dilated convolutions fail to capture the features of adjacent time steps, resulting in the loss of crucial transitional data. (2) The connections within the same temporal slice are strong, while the connections between different temporal slices are too loose. In light of these limitations, we emphasize the importance of analyzing a complete time series repeatedly and the crucial role of GRU in MTS. Therefore, we propose SGRU: Structured Gated Recurrent Units, which involve structured GRU layers and non-linear units, along with multiple layers of time embedding to enhance the model's fitting performance. We evaluate our approach on four publicly available California traffic datasets: PeMS03, PeMS04, PeMS07, and PeMS08 for regression prediction. Experimental results demonstrate that our model outperforms baseline models with average improvements of 11.7%, 18.6%, 18.5%, and 12.0% respectively.
- [180] arXiv:2404.11875 [ pdf , ps , html , other ]
-
Title: Concept Induction using LLMs: a user experiment for assessmentSubjects: Artificial Intelligence (cs.AI)
Abstract: Explainable Artificial Intelligence (XAI) poses a significant challenge in providing transparent and understandable insights into complex AI models. Traditional post-hoc algorithms, while useful, often struggle to deliver interpretable explanations. Concept-based models offer a promising avenue by incorporating explicit representations of concepts to enhance interpretability. However, existing research on automatic concept discovery methods is often limited by lower-level concepts, costly human annotation requirements, and a restricted domain of background knowledge. In this study, we explore the potential of a Large Language Model (LLM), specifically GPT-4, by leveraging its domain knowledge and common-sense capability to generate high-level concepts that are meaningful as explanations for humans, for a specific setting of image classification. We use minimal textual object information available in the data via prompting to facilitate this process. To evaluate the output, we compare the concepts generated by the LLM with two other methods: concepts generated by humans and the ECII heuristic concept induction system. Since there is no established metric to determine the human understandability of concepts, we conducted a human study to assess the effectiveness of the LLM-generated concepts. Our findings indicate that while human-generated explanations remain superior, concepts derived from GPT-4 are more comprehensible to humans compared to those generated by ECII.
- [181] arXiv:2404.11891 [ pdf , ps , html , other ]
-
Title: Large Language Models Can Plan Your Travels Rigorously with Formal Verification ToolsComments: 31 pages, 3 figures, 4 tables, submitted to ACL RRSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: The recent advancements of Large Language Models (LLMs), with their abundant world knowledge and capabilities of tool-using and reasoning, fostered many LLM planning algorithms. However, LLMs have not shown to be able to accurately solve complex combinatorial optimization problems. In Xie et al. (2024), the authors proposed TravelPlanner, a U.S. domestic travel planning benchmark, and showed that LLMs themselves cannot make travel plans that satisfy user requirements with a best success rate of 0.6%. In this work, we propose a framework that enables LLMs to formally formulate and solve the travel planning problem as a satisfiability modulo theory (SMT) problem and use SMT solvers interactively and automatically solve the combinatorial search problem. The SMT solvers guarantee the satisfiable of input constraints and the LLMs can enable a language-based interaction with our framework. When the input constraints cannot be satisfiable, our LLM-based framework will interactively offer suggestions to users to modify their travel requirements via automatic reasoning using the SMT solvers. We evaluate our framework with TravelPlanner and achieve a success rate of 97%. We also create a separate dataset that contain international travel benchmarks and use both dataset to evaluate the effectiveness of our interactive planning framework when the initial user queries cannot be satisfied. Our framework could generate valid plans with an average success rate of 78.6% for our dataset and 85.0% for TravelPlanner according to diverse humans preferences.
- [182] arXiv:2404.11898 [ pdf , ps , other ]
-
Title: Enhancing Financial Inclusion and Regulatory Challenges: A Critical Analysis of Digital Banks and Alternative Lenders Through Digital Platforms, Machine Learning, and Large Language Models IntegrationComments: 17 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper explores the dual impact of digital banks and alternative lenders on financial inclusion and the regulatory challenges posed by their business models. It discusses the integration of digital platforms, machine learning (ML), and Large Language Models (LLMs) in enhancing financial services accessibility for underserved populations. Through a detailed analysis of operational frameworks and technological infrastructures, this research identifies key mechanisms that facilitate broader financial access and mitigate traditional barriers. Additionally, the paper addresses significant regulatory concerns involving data privacy, algorithmic bias, financial stability, and consumer protection. Employing a mixed-methods approach, which combines quantitative financial data analysis with qualitative insights from industry experts, this paper elucidates the complexities of leveraging digital technology to foster financial inclusivity. The findings underscore the necessity of evolving regulatory frameworks that harmonize innovation with comprehensive risk management. This paper concludes with policy recommendations for regulators, financial institutions, and technology providers, aiming to cultivate a more inclusive and stable financial ecosystem through prudent digital technology integration.
- [183] arXiv:2404.11907 [ pdf , ps , html , other ]
-
Title: Sampling-based Pareto Optimization for Chance-constrained Monotone Submodular ProblemsSubjects: Artificial Intelligence (cs.AI)
Abstract: Recently surrogate functions based on the tail inequalities were developed to evaluate the chance constraints in the context of evolutionary computation and several Pareto optimization algorithms using these surrogates were successfully applied in optimizing chance-constrained monotone submodular problems. However, the difference in performance between algorithms using the surrogates and those employing the direct sampling-based evaluation remains unclear. Within the paper, a sampling-based method is proposed to directly evaluate the chance constraint. Furthermore, to address the problems with more challenging settings, an enhanced GSEMO algorithm integrated with an adaptive sliding window, called ASW-GSEMO, is introduced. In the experiments, the ASW-GSEMO employing the sampling-based approach is tested on the chance-constrained version of the maximum coverage problem with different settings. Its results are compared with those from other algorithms using different surrogate functions. The experimental findings indicate that the ASW-GSEMO with the sampling-based evaluation approach outperforms other algorithms, highlighting that the performances of algorithms using different evaluation methods are comparable. Additionally, the behaviors of ASW-GSEMO are visualized to explain the distinctions between it and the algorithms utilizing the surrogate functions.
- [184] arXiv:2404.11924 [ pdf , ps , html , other ]
-
Title: Toward Short-Term Glucose Prediction Solely Based on CGM Time SeriesSubjects: Artificial Intelligence (cs.AI)
Abstract: The global diabetes epidemic highlights the importance of maintaining good glycemic control. Glucose prediction is a fundamental aspect of diabetes management, facilitating real-time decision-making. Recent research has introduced models focusing on long-term glucose trend prediction, which are unsuitable for real-time decision-making and result in delayed responses. Conversely, models designed to respond to immediate glucose level changes cannot analyze glucose variability comprehensively. Moreover, contemporary research generally integrates various physiological parameters (e.g. insulin doses, food intake, etc.), which inevitably raises data privacy concerns. To bridge such a research gap, we propose TimeGlu -- an end-to-end pipeline for short-term glucose prediction solely based on CGM time series data. We implement four baseline methods to conduct a comprehensive comparative analysis of the model's performance. Through extensive experiments on two contrasting datasets (CGM Glucose and Colas dataset), TimeGlu achieves state-of-the-art performance without the need for additional personal data from patients, providing effective guidance for real-world diabetic glucose management.
- [185] arXiv:2404.11962 [ pdf , ps , html , other ]
-
Title: \copyright Plug-in Authorization for Human Content Copyright Protection in Text-to-Image ModelComments: 20 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: This paper addresses the contentious issue of copyright infringement in images generated by text-to-image models, sparking debates among AI developers, content creators, and legal entities. State-of-the-art models create high-quality content without crediting original creators, causing concern in the artistic community. To mitigate this, we propose the ©Plug-in Authorization framework, introducing three operations: addition, extraction, and combination. Addition involves training a ©plug-in for specific copyright, facilitating proper credit attribution. Extraction allows creators to reclaim copyright from infringing models, and combination enables users to merge different ©plug-ins. These operations act as permits, incentivizing fair use and providing flexibility in authorization. We present innovative approaches,"Reverse LoRA" for extraction and "EasyMerge" for seamless combination. Experiments in artist-style replication and cartoon IP recreation demonstrate ©plug-ins' effectiveness, offering a valuable solution for human copyright protection in the age of generative AIs.
- [186] arXiv:2404.11964 [ pdf , ps , html , other ]
-
Title: From Language Models to Practical Self-Improving Computer AgentsSubjects: Artificial Intelligence (cs.AI)
Abstract: We develop a simple and straightforward methodology to create AI computer agents that can carry out diverse computer tasks and self-improve by developing tools and augmentations to enable themselves to solve increasingly complex tasks. As large language models (LLMs) have been shown to benefit from non-parametric augmentations, a significant body of recent work has focused on developing software that augments LLMs with various capabilities. Rather than manually developing static software to augment LLMs through human engineering effort, we propose that an LLM agent can systematically generate software to augment itself. We show, through a few case studies, that a minimal querying loop with appropriate prompt engineering allows an LLM to generate and use various augmentations, freely extending its own capabilities to carry out real-world computer tasks. Starting with only terminal access, we prompt an LLM agent to augment itself with retrieval, internet search, web navigation, and text editor capabilities. The agent effectively uses these various tools to solve problems including automated software development and web-based tasks.
- [187] arXiv:2404.11973 [ pdf , ps , other ]
-
Title: Exploring the landscape of large language models: Foundations, techniques, and challengesSubjects: Artificial Intelligence (cs.AI)
Abstract: In this review paper, we delve into the realm of Large Language Models (LLMs), covering their foundational principles, diverse applications, and nuanced training processes. The article sheds light on the mechanics of in-context learning and a spectrum of fine-tuning approaches, with a special focus on methods that optimize efficiency in parameter usage. Additionally, it explores how LLMs can be more closely aligned with human preferences through innovative reinforcement learning frameworks and other novel methods that incorporate human feedback. The article also examines the emerging technique of retrieval augmented generation, integrating external knowledge into LLMs. The ethical dimensions of LLM deployment are discussed, underscoring the need for mindful and responsible application. Concluding with a perspective on future research trajectories, this review offers a succinct yet comprehensive overview of the current state and emerging trends in the evolving landscape of LLMs, serving as an insightful guide for both researchers and practitioners in artificial intelligence.
- [188] arXiv:2404.11988 [ pdf , ps , html , other ]
-
Title: The Emerging AI Divide in the United StatesSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: The digital divide describes disparities in access to and usage of digital tooling between social and economic groups. Emerging generative artificial intelligence tools, which strongly affect productivity, could magnify the impact of these divides. However, the affordability, multi-modality, and multilingual capabilities of these tools could also make them more accessible to diverse users in comparison with previous forms of digital tooling. In this study, we characterize spatial differences in U.S. residents' knowledge of a new generative AI tool, ChatGPT, through an analysis of state- and county-level search query data. In the first six months after the tool's release, we observe the highest rates of users searching for ChatGPT in West Coast states and persistently low rates of search in Appalachian and Gulf states. Counties with the highest rates of search are relatively more urbanized and have proportionally more educated, more economically advantaged, and more Asian residents in comparison with other counties or with the U.S. average. In multilevel models adjusting for socioeconomic and demographic factors as well as industry makeup, education is the strongest positive predictor of rates of search for generative AI tooling. Although generative AI technologies may be novel, early differences in uptake appear to be following familiar paths of digital marginalization.
- [189] arXiv:2404.11996 [ pdf , ps , html , other ]
-
Title: DST-GTN: Dynamic Spatio-Temporal Graph Transformer Network for Traffic ForecastingSongtao Huang , Hongjin Song , Tianqi Jiang , Akbar Telikani , Jun Shen , Qingguo Zhou , Binbin Yong , Qiang WuSubjects: Artificial Intelligence (cs.AI)
Abstract: Accurate traffic forecasting is essential for effective urban planning and congestion management. Deep learning (DL) approaches have gained colossal success in traffic forecasting but still face challenges in capturing the intricacies of traffic dynamics. In this paper, we identify and address this challenges by emphasizing that spatial features are inherently dynamic and change over time. A novel in-depth feature representation, called Dynamic Spatio-Temporal (Dyn-ST) features, is introduced, which encapsulates spatial characteristics across varying times. Moreover, a Dynamic Spatio-Temporal Graph Transformer Network (DST-GTN) is proposed by capturing Dyn-ST features and other dynamic adjacency relations between intersections. The DST-GTN can model dynamic ST relationships between nodes accurately and refine the representation of global and local ST characteristics by adopting adaptive weights in low-pass and all-pass filters, enabling the extraction of Dyn-ST features from traffic time-series data. Through numerical experiments on public datasets, the DST-GTN achieves state-of-the-art performance for a range of traffic forecasting tasks and demonstrates enhanced stability.
- [190] arXiv:2404.12045 [ pdf , ps , html , other ]
-
Title: RAM: Towards an Ever-Improving Memory System by Learning from CommunicationsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: We introduce RAM, an innovative RAG-based framework with an ever-improving memory. Inspired by humans' pedagogical process, RAM utilizes recursively reasoning-based retrieval and experience reflections to continually update the memory and learn from users' communicative feedback, namely communicative learning. Extensive experiments with both simulated and real users demonstrate significant improvements over traditional RAG and self-knowledge methods, particularly excelling in handling false premise and multi-hop questions. Furthermore, RAM exhibits promising adaptability to various feedback and retrieval method chain types, showcasing its potential for advancing AI capabilities in dynamic knowledge acquisition and lifelong learning.
- [191] arXiv:2404.12076 [ pdf , ps , html , other ]
-
Title: Evolutionary Multi-Objective Optimisation for Fairness-Aware Self Adjusting Memory Classifiers in Data StreamsComments: This paper has been accepted by GECCO 2024Subjects: Artificial Intelligence (cs.AI) ; Neural and Evolutionary Computing (cs.NE)
Abstract: This paper introduces a novel approach, evolutionary multi-objective optimisation for fairness-aware self-adjusting memory classifiers, designed to enhance fairness in machine learning algorithms applied to data stream classification. With the growing concern over discrimination in algorithmic decision-making, particularly in dynamic data stream environments, there is a need for methods that ensure fair treatment of individuals across sensitive attributes like race or gender. The proposed approach addresses this challenge by integrating the strengths of the self-adjusting memory K-Nearest-Neighbour algorithm with evolutionary multi-objective optimisation. This combination allows the new approach to efficiently manage concept drift in streaming data and leverage the flexibility of evolutionary multi-objective optimisation to maximise accuracy and minimise discrimination simultaneously. We demonstrate the effectiveness of the proposed approach through extensive experiments on various datasets, comparing its performance against several baseline methods in terms of accuracy and fairness metrics. Our results show that the proposed approach maintains competitive accuracy and significantly reduces discrimination, highlighting its potential as a robust solution for fairness-aware data stream classification. Further analyses also confirm the effectiveness of the strategies to trigger evolutionary multi-objective optimisation and adapt classifiers in the proposed approach.
- [192] arXiv:2404.12090 [ pdf , ps , html , other ]
-
Title: X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement LearnerHaoyuan Jiang , Ziyue Li , Hua Wei , Xuantang Xiong , Jingqing Ruan , Jiaming Lu , Hangyu Mao , Rui ZhaoComments: Accepted by IJCAI 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model's robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.
- [193] arXiv:2404.12127 [ pdf , ps , html , other ]
-
Title: Personalized Forgetting Mechanism with Concept-Driven Knowledge TracingComments: under reviewSubjects: Artificial Intelligence (cs.AI)
Abstract: Knowledge Tracing (KT) aims to trace changes in students' knowledge states throughout their entire learning process by analyzing their historical learning data and predicting their future learning performance. Existing forgetting curve theory based knowledge tracing models only consider the general forgetting caused by time intervals, ignoring the individualization of students and the causal relationship of the forgetting process. To address these problems, we propose a Concept-driven Personalized Forgetting knowledge tracing model (CPF) which integrates hierarchical relationships between knowledge concepts and incorporates students' personalized cognitive abilities. First, we integrate the students' personalized capabilities into both the learning and forgetting processes to explicitly distinguish students' individual learning gains and forgetting rates according to their cognitive abilities. Second, we take into account the hierarchical relationships between knowledge points and design a precursor-successor knowledge concept matrix to simulate the causal relationship in the forgetting process, while also integrating the potential impact of forgetting prior knowledge points on subsequent ones. The proposed personalized forgetting mechanism can not only be applied to the learning of specifc knowledge concepts but also the life-long learning process. Extensive experimental results on three public datasets show that our CPF outperforms current forgetting curve theory based methods in predicting student performance, demonstrating CPF can better simulate changes in students' knowledge status through the personalized forgetting mechanism.
- [194] arXiv:2404.12134 [ pdf , ps , html , other ]
-
Title: Warped Time Series Anomaly DetectionCharlotte Lacoquelle , Xavier Pucel , Louise Travé-Massuyès (LAAS-DISCO, UT), Axel Reymonet , Benoît EnauxSubjects: Artificial Intelligence (cs.AI) ; Signal Processing (eess.SP)
Abstract: This paper addresses the problem of detecting time series outliers, focusing on systems with repetitive behavior, such as industrial robots operating on production lines.Notable challenges arise from the fact that a task performed multiple times may exhibit different duration in each repetition and that the time series reported by the sensors are irregularly sampled because of data gaps. The anomaly detection approach presented in this paper consists of three stages.The first stage identifies the repetitive cycles in the lengthy time series and segments them into individual time series corresponding to one task cycle, while accounting for possible temporal distortions.The second stage computes a prototype for the cycles using a GPU-based barycenter algorithm, specifically tailored for very large time series.The third stage uses the prototype to detect abnormal cycles by computing an anomaly score for each cycle.The overall approach, named WarpEd Time Series ANomaly Detection (WETSAND), makes use of the Dynamic Time Warping algorithm and its variants because they are suited to the distorted nature of the time series.The experiments show that \wetsand scales to large signals, computes human-friendly prototypes, works with very little data, and outperforms some general purpose anomaly detection approaches such as autoencoders.
- [195] arXiv:2404.12138 [ pdf , ps , html , other ]
-
Title: Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?Rui Xu , Xintao Wang , Jiangjie Chen , Siyu Yuan , Xinfeng Yuan , Jiaqing Liang , Zulong Chen , Xiaoqing Dong , Yanghua XiaoSubjects: Artificial Intelligence (cs.AI)
Abstract: Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided with the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,401 character decision points from 395 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and methods for LLM role-playing. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet there is substantial room for improvement. Hence, we further propose the CHARMAP method, which achieves a 6.01% increase in accuracy via persona-based memory retrieval. We will make our datasets and code publicly available.
- [196] arXiv:2404.12143 [ pdf , ps , html , other ]
-
Title: The Neutrality Fallacy: When Algorithmic Fairness Interventions are (Not) Positive ActionJournal-ref: 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24)Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: Various metrics and interventions have been developed to identify and mitigate unfair outputs of machine learning systems. While individuals and organizations have an obligation to avoid discrimination, the use of fairness-aware machine learning interventions has also been described as amounting to 'algorithmic positive action' under European Union (EU) non-discrimination law. As the Court of Justice of the European Union has been strict when it comes to assessing the lawfulness of positive action, this would impose a significant legal burden on those wishing to implement fair-ml interventions. In this paper, we propose that algorithmic fairness interventions often should be interpreted as a means to prevent discrimination, rather than a measure of positive action. Specifically, we suggest that this category mistake can often be attributed to neutrality fallacies: faulty assumptions regarding the neutrality of fairness-aware algorithmic decision-making. Our findings raise the question of whether a negative obligation to refrain from discrimination is sufficient in the context of algorithmic decision-making. Consequently, we suggest moving away from a duty to 'not do harm' towards a positive obligation to actively 'do no harm' as a more adequate framework for algorithmic decision-making and fair ml-interventions.
- [197] arXiv:2404.12149 [ pdf , ps , html , other ]
-
Title: AccidentBlip2: Accident Detection With Multi-View MotionBlip2Yihua Shao , Hongyi Cai , Xinwei Long , Weiyi Lang , Zhe Wang , Haoran Wu , Yan Wang , Jiayi Yin , Yang Yang , Yisheng Lv , Zhen LeiSubjects: Artificial Intelligence (cs.AI)
Abstract: Intelligent vehicles have demonstrated excellent capabilities in many transportation scenarios. The inference capabilities of neural networks using cameras limit the accuracy of accident detection in complex transportation systems. This paper presents AccidentBlip2, a pure vision-based multi-modal large model Blip2 for accident detection. Our method first processes the multi-view images through ViT-14g and sends the multi-view features into the cross-attention layer of Q-Former. Different from Blip2's Q-Former, our Motion Q-Former extends the self-attention layer with the temporal-attention layer. In the inference process, the queries generated from previous frames are input into Motion Q-Former to aggregate temporal information. Queries are updated with an auto-regressive strategy and are sent to a MLP to detect whether there is an accident in the surrounding environment. Our AccidentBlip2 can be extended to a multi-vehicle cooperative system by deploying Motion Q-Former on each vehicle and simultaneously fusing the generated queries into the MLP for auto-regressive inference. Our approach outperforms existing video large language models in detection accuracy in both single-vehicle and multi-vehicle systems.
- [198] arXiv:2404.12185 [ pdf , ps , html , other ]
-
Title: An Adaptive Metaheuristic Framework for Changing EnvironmentsComments: Accepted in 2024 IEEE Congress on Evolutionary Computation (CEC)Subjects: Artificial Intelligence (cs.AI)
Abstract: The rapidly changing landscapes of modern optimization problems require algorithms that can be adapted in real-time. This paper introduces an Adaptive Metaheuristic Framework (AMF) designed for dynamic environments. It is capable of intelligently adapting to changes in the problem parameters. The AMF combines a dynamic representation of problems, a real-time sensing system, and adaptive techniques to navigate continuously changing optimization environments. Through a simulated dynamic optimization problem, the AMF's capability is demonstrated to detect environmental changes and proactively adjust its search strategy. This framework utilizes a differential evolution algorithm that is improved with an adaptation module that adjusts solutions in response to detected changes. The capability of the AMF to adjust is tested through a series of iterations, demonstrating its resilience and robustness in sustaining solution quality despite the problem's development. The effectiveness of AMF is demonstrated through a series of simulations on a dynamic optimization problem. Robustness and agility characterize the algorithm's performance, as evidenced by the presented fitness evolution and solution path visualizations. The findings show that AMF is a practical solution to dynamic optimization and a major step forward in the creation of algorithms that can handle the unpredictability of real-world problems.
- [199] arXiv:2404.12228 [ pdf , ps , html , other ]
-
Title: Relationship Discovery for Drug RecommendationSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Medication recommendation systems are designed to deliver personalized drug suggestions that are closely aligned with individual patient needs. Previous studies have primarily concentrated on developing medication embeddings, achieving significant progress. Nonetheless, these approaches often fall short in accurately reflecting individual patient profiles, mainly due to challenges in distinguishing between various patient conditions and the inability to establish precise correlations between specific conditions and appropriate medications. In response to these issues, we introduce DisMed, a model that focuses on patient conditions to enhance personalization. DisMed employs causal inference to discern clear, quantifiable causal links. It then examines patient conditions in depth, recognizing and adapting to the evolving nuances of these conditions, and mapping them directly to corresponding medications. Additionally, DisMed leverages data from multiple patient visits to propose combinations of medications. Comprehensive testing on real-world datasets demonstrates that DisMed not only improves the customization of patient profiles but also surpasses leading models in both precision and safety.
- [200] arXiv:2404.12240 [ pdf , ps , html , other ]
-
Title: A Time-Inhomogeneous Markov Model for Resource Availability under Sparse ObservationsComments: 11 pages, long version of a paper published at 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2018)Journal-ref: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 460-463) 2018Subjects: Artificial Intelligence (cs.AI)
Abstract: Accurate spatio-temporal information about the current situation is crucial for smart city applications such as modern routing algorithms. Often, this information describes the state of stationary resources, e.g. the availability of parking bays, charging stations or the amount of people waiting for a vehicle to pick them up near a given location. To exploit this kind of information, predicting future states of the monitored resources is often mandatory because a resource might change its state within the time until it is needed. To train an accurate predictive model, it is often not possible to obtain a continuous time series on the state of the resource. For example, the information might be collected from traveling agents visiting the resource with an irregular frequency. Thus, it is necessary to develop methods which work on sparse observations for training and prediction. In this paper, we propose time-inhomogeneous discrete Markov models to allow accurate prediction even when the frequency of observation is very rare. Our new model is able to blend recent observations with historic data and also provide useful probabilistic estimates for future states. Since resources availability in a city is typically time-dependent, our Markov model is time-inhomogeneous and cyclic within a predefined time interval. To train our model, we propose a modified Baum-Welch algorithm. Evaluations on real-world datasets of parking bay availability show that our new method indeed yields good results compared to methods being trained on complete data and non-cyclic variants.
- [201] arXiv:2404.12273 [ pdf , ps , html , other ]
-
Title: FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective WisdomComments: In ProgressSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.
- [202] arXiv:2404.12278 [ pdf , ps , html , other ]
-
Title: DF-DM: A foundational process model for multimodal data fusion in the artificial intelligence eraDavid Restrepo , Chenwei Wu , Constanza Vásquez-Venegas , Luis Filipe Nakayama , Leo Anthony Celi , Diego M LópezComments: 6 figures, 5 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion", a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information.
We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis.
These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings. - [203] arXiv:2404.12349 [ pdf , ps , html , other ]
-
Title: Evaluating AI for Law: Bridging the Gap with Open-Source SolutionsSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: This study evaluates the performance of general-purpose AI, like ChatGPT, in legal question-answering tasks, highlighting significant risks to legal professionals and clients. It suggests leveraging foundational models enhanced by domain-specific knowledge to overcome these issues. The paper advocates for creating open-source legal AI systems to improve accuracy, transparency, and narrative diversity, addressing general AI's shortcomings in legal contexts.
- [204] arXiv:2404.12361 [ pdf , ps , html , other ]
-
Title: Learning the Domain Specific Inverse NUFFT for Accelerated Spiral MRI using Diffusion ModelsSubjects: Artificial Intelligence (cs.AI) ; Medical Physics (physics.med-ph)
Abstract: Deep learning methods for accelerated MRI achieve state-of-the-art results but largely ignore additional speedups possible with noncartesian sampling trajectories. To address this gap, we created a generative diffusion model-based reconstruction algorithm for multi-coil highly undersampled spiral MRI. This model uses conditioning during training as well as frequency-based guidance to ensure consistency between images and measurements. Evaluated on retrospective data, we show high quality (structural similarity > 0.87) in reconstructed images with ultrafast scan times (0.02 seconds for a 2D image). We use this algorithm to identify a set of optimal variable-density spiral trajectories and show large improvements in image quality compared to conventional reconstruction using the non-uniform fast Fourier transform. By combining efficient spiral sampling trajectories, multicoil imaging, and deep learning reconstruction, these methods could enable the extremely high acceleration factors needed for real-time 3D imaging.
- [205] arXiv:2404.12458 [ pdf , ps , other ]
-
Title: The collective use and evaluation of generative AI tools in digital humanities research: Survey-based resultsSubjects: Artificial Intelligence (cs.AI)
Abstract: The advent of generative artificial intelligence (GenAI) technologies has revolutionized research, with significant implications for Digital Humanities (DH), a field inherently intertwined with technological progress. This article investigates how digital humanities scholars adopt, practice, as well as critically evaluate, GenAI technologies such as ChatGPT in the research process. Drawing on 76 responses collected from an international survey study, we explored digital humanities scholars' rationale for GenAI adoption in research, identified specific use cases and practices of using GenAI to support various DH research tasks, and analyzed scholars' collective perceptions of GenAI's benefits, risks, and impact on DH research. The survey results suggest that DH research communities hold divisive sentiments towards the value of GenAI in DH scholarship, whereas the actual usage diversifies among individuals and across research tasks. Our survey-based analysis has the potential to serve as a basis for further empirical research on the impact of GenAI on the evolution of DH scholarship.
- [206] arXiv:2404.12460 [ pdf , ps , html , other ]
-
Title: NLP-enabled trajectory map-matching in urban road networks using transformer sequence-to-sequence modelComments: 14 pages, 10 figuresSubjects: Artificial Intelligence (cs.AI) ; Computational Engineering, Finance, and Science (cs.CE)
Abstract: Large-scale geolocation telematics data acquired from connected vehicles has the potential to significantly enhance mobility infrastructures and operational systems within smart cities. To effectively utilize this data, it is essential to accurately match the geolocation data to the road segments. However, this matching is often not trivial due to the low sampling rate and errors exacerbated by multipath effects in urban environments. Traditionally, statistical modeling techniques such as Hidden-Markov models incorporating domain knowledge into the matching process have been extensively used for map-matching tasks. However, rule-based map-matching tasks are noise-sensitive and inefficient in processing large-scale trajectory data. Deep learning techniques directly learn the relationship between observed data and road networks from the data, often without the need for hand-crafted rules or domain knowledge. This renders them an efficient approach for map-matching large-scale datasets and makes them more robust to the noise. This paper introduces a sequence-to-sequence deep-learning model, specifically the transformer-based encoder-decoder model, to perform as a surrogate for map-matching algorithms. The encoder-decoder architecture initially encodes the series of noisy GPS points into a representation that automatically captures autoregressive behavior and spatial correlations between GPS points. Subsequently, the decoder associates data points with the road network features and thus transforms these representations into a sequence of road segments. The model is trained and evaluated using GPS traces collected in Manhattan, New York. Achieving an accuracy of 76%, transformer-based encoder-decoder models extensively employed in natural language processing presented a promising performance for translating noisy GPS data to the navigated routes in urban road networks.
- [207] arXiv:2404.12520 [ pdf , ps , html , other ]
-
Title: Centralized vs. Decentralized Multi-Agent Reinforcement Learning for Enhanced Control of Electric Vehicle Charging NetworksComments: 12 pages, 9 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: The widespread adoption of electric vehicles (EVs) poses several challenges to power distribution networks and smart grid infrastructure due to the possibility of significantly increasing electricity demands, especially during peak hours. Furthermore, when EVs participate in demand-side management programs, charging expenses can be reduced by using optimal charging control policies that fully utilize real-time pricing schemes. However, devising optimal charging methods and control strategies for EVs is challenging due to various stochastic and uncertain environmental factors. Currently, most EV charging controllers operate based on a centralized model. In this paper, we introduce a novel approach for distributed and cooperative charging strategy using a Multi-Agent Reinforcement Learning (MARL) framework. Our method is built upon the Deep Deterministic Policy Gradient (DDPG) algorithm for a group of EVs in a residential community, where all EVs are connected to a shared transformer. This method, referred to as CTDE-DDPG, adopts a Centralized Training Decentralized Execution (CTDE) approach to establish cooperation between agents during the training phase, while ensuring a distributed and privacy-preserving operation during execution. We theoretically examine the performance of centralized and decentralized critics for the DDPG-based MARL implementation and demonstrate their trade-offs. Furthermore, we numerically explore the efficiency, scalability, and performance of centralized and decentralized critics. Our theoretical and numerical results indicate that, despite higher policy gradient variances and training complexity, the CTDE-DDPG framework significantly improves charging efficiency by reducing total variation by approximately %36 and charging cost by around %9.1 on average...
- [208] arXiv:2404.12534 [ pdf , ps , html , other ]
-
Title: Towards Large Language Models as Copilots for Theorem Proving in LeanComments: All code open-sourced at this https URLSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
Abstract: Theorem proving is an important challenge for large language models (LLMs), as formal proofs can be checked rigorously by proof assistants such as Lean, leaving no room for hallucination. Existing LLM-based provers try to prove theorems in a fully autonomous mode without human intervention. In this mode, they struggle with novel and challenging theorems, for which human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a framework for running LLM inference in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Using Lean Copilot, we build tools for suggesting proof steps (tactic suggestion), completing intermediate proof goals (proof search), and selecting relevant premises (premise selection) using LLMs. Users can use our pretrained models or bring their own ones that run either locally (with or without GPUs) or on the cloud. Experimental results demonstrate the effectiveness of our method in assisting humans and automating theorem proving process compared to existing rule-based proof automation in Lean. We open source all codes under a permissive MIT license to facilitate further research.
- [209] arXiv:2404.12548 [ pdf , ps , html , other ]
-
Title: RetailOpt: An Opt-In, Easy-to-Deploy Trajectory Estimation System Leveraging Smartphone Motion Data and Retail Facility InformationSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: We present RetailOpt, a novel opt-in, easy-to-deploy system for tracking customer movements in indoor retail environments. The system utilizes information presently accessible to customers through smartphones and retail apps: motion data, store map, and purchase records. The approach eliminates the need for additional hardware installations/maintenance and ensures customers maintain full control of their data. Specifically, RetailOpt first employs inertial navigation to recover relative trajectories from smartphone motion data. The store map and purchase records are then cross-referenced to identify a list of visited shelves, providing anchors to localize the relative trajectories in a store through continuous and discrete optimization. We demonstrate the effectiveness of our system through systematic experiments in five diverse environments. The proposed system, if successful, would produce accurate customer movement data, essential for a broad range of retail applications, including customer behavior analysis and in-store navigation. The potential application could also extend to other domains such as entertainment and assistive technologies.
- [210] arXiv:2404.12587 [ pdf , ps , html , other ]
-
Title: Reinforcement Learning Approach for Integrating Compressed Contexts into Knowledge GraphsComments: This paper has been accepted by the 2024 International Conference on Machine Learning and Neural Networks (MLNN 2024)Subjects: Artificial Intelligence (cs.AI)
Abstract: The widespread use of knowledge graphs in various fields has brought about a challenge in effectively integrating and updating information within them. When it comes to incorporating contexts, conventional methods often rely on rules or basic machine learning models, which may not fully grasp the complexity and fluidity of context information. This research suggests an approach based on reinforcement learning (RL), specifically utilizing Deep Q Networks (DQN) to enhance the process of integrating contexts into knowledge graphs. By considering the state of the knowledge graph as environment states defining actions as operations for integrating contexts and using a reward function to gauge the improvement in knowledge graph quality post-integration, this method aims to automatically develop strategies for optimal context integration. Our DQN model utilizes networks as function approximators, continually updating Q values to estimate the action value function, thus enabling effective integration of intricate and dynamic context information. Initial experimental findings show that our RL method outperforms techniques in achieving precise context integration across various standard knowledge graph datasets, highlighting the potential and effectiveness of reinforcement learning in enhancing and managing knowledge graphs.
- [211] arXiv:2404.12605 [ pdf , ps , html , other ]
-
Title: GluMarker: A Novel Predictive Modeling of Glycemic Control Through Digital BiomarkersSubjects: Artificial Intelligence (cs.AI)
Abstract: The escalating prevalence of diabetes globally underscores the need for diabetes management. Recent research highlights the growing focus on digital biomarkers in diabetes management, with innovations in computational frameworks and noninvasive monitoring techniques using personalized glucose metrics. However, they predominantly focus on insulin dosing and specific glucose values, or with limited attention given to overall glycemic control. This leaves a gap in expanding the scope of digital biomarkers for overall glycemic control in diabetes management. To address such a research gap, we propose GluMarker -- an end-to-end framework for modeling digital biomarkers using broader factors sources to predict glycemic control. Through the assessment and refinement of various machine learning baselines, GluMarker achieves state-of-the-art on Anderson's dataset in predicting next-day glycemic control. Moreover, our research identifies key digital biomarkers for the next day's glycemic control prediction. These identified biomarkers are instrumental in illuminating the daily factors that influence glycemic management, offering vital insights for diabetes care.
- [212] arXiv:2404.12626 [ pdf , ps , html , other ]
-
Title: Grasper: A Generalist Pursuer for Pursuit-Evasion ProblemsPengdeng Li , Shuxin Li , Xinrun Wang , Jakub Cerny , Youzhi Zhang , Stephen McAleer , Hau Chan , Bo AnComments: To appear in the 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2024)Subjects: Artificial Intelligence (cs.AI) ; Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Abstract: Pursuit-evasion games (PEGs) model interactions between a team of pursuers and an evader in graph-based environments such as urban street networks. Recent advancements have demonstrated the effectiveness of the pre-training and fine-tuning paradigm in PSRO to improve scalability in solving large-scale PEGs. However, these methods primarily focus on specific PEGs with fixed initial conditions that may vary substantially in real-world scenarios, which significantly hinders the applicability of the traditional methods. To address this issue, we introduce Grasper, a GeneRAlist purSuer for Pursuit-Evasion pRoblems, capable of efficiently generating pursuer policies tailored to specific PEGs. Our contributions are threefold: First, we present a novel architecture that offers high-quality solutions for diverse PEGs, comprising critical components such as (i) a graph neural network (GNN) to encode PEGs into hidden vectors, and (ii) a hypernetwork to generate pursuer policies based on these hidden vectors. As a second contribution, we develop an efficient three-stage training method involving (i) a pre-pretraining stage for learning robust PEG representations through self-supervised graph learning techniques like GraphMAE, (ii) a pre-training stage utilizing heuristic-guided multi-task pre-training (HMP) where heuristic-derived reference policies (e.g., through Dijkstra's algorithm) regularize pursuer policies, and (iii) a fine-tuning stage that employs PSRO to generate pursuer policies on designated PEGs. Finally, we perform extensive experiments on synthetic and real-world maps, showcasing Grasper's significant superiority over baselines in terms of solution quality and generalizability. We demonstrate that Grasper provides a versatile approach for solving pursuit-evasion problems across a broad range of scenarios, enabling practical deployment in real-world situations.
- [213] arXiv:2404.12633 [ pdf , ps , html , other ]
-
Title: FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource AllocationComments: Accepted by IJCAI 2024Subjects: Artificial Intelligence (cs.AI) ; Networking and Internet Architecture (cs.NI)
Abstract: Virtual network embedding (VNE) is an essential resource allocation task in network virtualization, aiming to map virtual network requests (VNRs) onto physical infrastructure. Reinforcement learning (RL) has recently emerged as a promising solution to this problem. However, existing RL-based VNE methods are limited by the unidirectional action design and one-size-fits-all training strategy, resulting in restricted searchability and generalizability. In this paper, we propose a FLexible And Generalizable RL framework for VNE, named FlagVNE. Specifically, we design a bidirectional action-based Markov decision process model that enables the joint selection of virtual and physical nodes, thus improving the exploration flexibility of solution space. To tackle the expansive and dynamic action space, we design a hierarchical decoder to generate adaptive action probability distributions and ensure high training efficiency. Furthermore, to overcome the generalization issue for varying VNR sizes, we propose a meta-RL-based training method with a curriculum scheduling strategy, facilitating specialized policy training for each VNR size. Finally, extensive experimental results show the effectiveness of FlagVNE across multiple key metrics. Our code is available at GitHub ( this https URL ).
- [214] arXiv:2404.12638 [ pdf , ps , html , other ]
-
Title: Learning to Cut via Hierarchical Sequence/Set Model for Efficient Mixed-Integer ProgrammingJie Wang , Zhihai Wang , Xijun Li , Yufei Kuang , Zhihao Shi , Fangzhou Zhu , Mingxuan Yuan , Jia Zeng , Yongdong Zhang , Feng WuComments: arXiv admin note: substantial text overlap with arXiv:2302.00244Subjects: Artificial Intelligence (cs.AI)
Abstract: Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. However, many existing learning-based methods learn which cuts to prefer, neglecting the importance of learning how many cuts to select. Moreover, we observe that (P3) what order of selected cuts to prefer significantly impacts the efficiency of MILP solvers as well. To address these challenges, we propose a novel hierarchical sequence/set model (HEM) to learn cut selection policies. Specifically, HEM is a bi-level model: (1) a higher-level module that learns how many cuts to select, (2) and a lower-level module -- that formulates the cut selection as a sequence/set to sequence learning problem -- to learn policies selecting an ordered subset with the cardinality determined by the higher-level module. To the best of our knowledge, HEM is the first data-driven methodology that well tackles (P1)-(P3) simultaneously. Experiments demonstrate that HEM significantly improves the efficiency of solving MILPs on eleven challenging MILP benchmarks, including two Huawei's real problems.
- [215] arXiv:2404.12653 [ pdf , ps , html , other ]
-
Title: How Real Is Real? A Human Evaluation Framework for Unrestricted Adversarial ExamplesDren Fazlija , Arkadij Orlov , Johanna Schrader , Monty-Maximilian Zühlke , Michael Rohs , Daniel KudenkoComments: 3 pages, 3 figures, AAAI 2024 Spring Symposium on User-Aligned Assessment of Adaptive AI SystemsSubjects: Artificial Intelligence (cs.AI)
Abstract: With an ever-increasing reliance on machine learning (ML) models in the real world, adversarial examples threaten the safety of AI-based systems such as autonomous vehicles. In the image domain, they represent maliciously perturbed data points that look benign to humans (i.e., the image modification is not noticeable) but greatly mislead state-of-the-art ML models. Previously, researchers ensured the imperceptibility of their altered data points by restricting perturbations via $\ell_p$ norms. However, recent publications claim that creating natural-looking adversarial examples without such restrictions is also possible. With much more freedom to instill malicious information into data, these unrestricted adversarial examples can potentially overcome traditional defense strategies as they are not constrained by the limitations or patterns these defenses typically recognize and mitigate. This allows attackers to operate outside of expected threat models. However, surveying existing image-based methods, we noticed a need for more human evaluations of the proposed image modifications. Based on existing human-assessment frameworks for image generation quality, we propose SCOOTER - an evaluation framework for unrestricted image-based attacks. It provides researchers with guidelines for conducting statistically significant human experiments, standardized questions, and a ready-to-use implementation. We propose a framework that allows researchers to analyze how imperceptible their unrestricted attacks truly are.
- [216] arXiv:2404.12691 [ pdf , ps , html , other ]
-
Title: Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?Shayne Longpre , Robert Mahari , Naana Obeng-Marnu , William Brannon , Tobin South , Katy Gero , Sandy Pentland , Jad KabbaraComments: 9 pages, 2 tablesSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.
- [217] arXiv:2404.12704 [ pdf , ps , html , other ]
-
Title: A Clean-graph Backdoor Attack against Graph Convolutional Networks with Poisoned Label OnlySubjects: Artificial Intelligence (cs.AI)
Abstract: Graph Convolutional Networks (GCNs) have shown excellent performance in dealing with various graph structures such as node classification, graph classification and other tasks. However,recent studies have shown that GCNs are vulnerable to a novel threat known as backdoor attacks. However, all existing backdoor attacks in the graph domain require modifying the training samples to accomplish the backdoor injection, which may not be practical in many realistic scenarios where adversaries have no access to modify the training samples and may leads to the backdoor attack being detected easily. In order to explore the backdoor vulnerability of GCNs and create a more practical and stealthy backdoor attack method, this paper proposes a clean-graph backdoor attack against GCNs (CBAG) in the node classification task,which only poisons the training labels without any modification to the training samples, revealing that GCNs have this security vulnerability. Specifically, CBAG designs a new trigger exploration method to find important feature dimensions as the trigger patterns to improve the attack performance. By poisoning the training labels, a hidden backdoor is injected into the GCNs model. Experimental results show that our clean graph backdoor can achieve 99% attack success rate while maintaining the functionality of the GCNs model on benign samples.
- [218] arXiv:2404.12760 [ pdf , ps , html , other ]
-
Title: Food Development through Co-creation with AI: bread with a "taste of love"Comments: Accepted to GenAICHI: CHI 2024 Workshop on Generative AI and HCISubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: This study explores a new method in food development by utilizing AI including generative AI, aiming to craft products that delight the senses and resonate with consumers' emotions. The food ingredient recommendation approach used in this study can be considered as a form of multimodal generation in a broad sense, as it takes text as input and outputs food ingredient candidates. This Study focused on producing "Romance Bread," a collection of breads infused with flavors that reflect the nuances of a romantic Japanese television program. We analyzed conversations from TV programs and lyrics from songs featuring fruits and sweets to recommend ingredients that express romantic feelings. Based on these recommendations, the bread developers then considered the flavoring of the bread and developed new bread varieties. The research included a tasting evaluation involving 31 participants and interviews with the product developers. Findings indicate a notable correlation between tastes generated by AI and human preferences. This study validates the concept of using AI in food innovation and highlights the broad potential for developing unique consumer experiences that focus on emotional engagement through AI and human collaboration.
- [219] arXiv:2404.12762 [ pdf , ps , html , other ]
-
Title: How should AI decisions be explained? Requirements for Explanations from the Perspective of European LawComments: to be submittedSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: This paper investigates the relationship between law and eXplainable Artificial Intelligence (XAI). While there is much discussion about the AI Act, for which the trilogue of the European Parliament, Council and Commission recently concluded, other areas of law seem underexplored. This paper focuses on European (and in part German) law, although with international concepts and regulations such as fiduciary plausibility checks, the General Data Protection Regulation (GDPR), and product safety and liability. Based on XAI-taxonomies, requirements for XAI-methods are derived from each of the legal bases, resulting in the conclusion that each legal basis requires different XAI properties and that the current state of the art does not fulfill these to full satisfaction, especially regarding the correctness (sometimes called fidelity) and confidence estimates of XAI-methods.
- [220] arXiv:2404.12926 [ pdf , ps , html , other ]
-
Title: MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-AnsweringAvinash Anand , Janak Kapuriya , Chhavi Kirtani , Apoorv Singh , Jay Saraf , Naman Lal , Jatin Kumar , Adarsh Raj Shivam , Astha Verma , Rajiv Ratn Shah , Roger ZimmermannSubjects: Artificial Intelligence (cs.AI)
Abstract: Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose an LMM-based chatbot to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. To improve the LMM's performance, we experiment with two techniques, RLHF (Reinforcement Learning from Human Feedback) and Image Captioning. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors. We further explore the integration of Reinforcement Learning from Human Feedback (RLHF) methodology inspired by the ranking approach in RLHF to enhance the human-like problem-solving abilities of the models. The RLHF approach incorporates human feedback into the learning process of LLMs, improving the model's problem-solving skills, truthfulness, and reasoning capabilities, minimizing the hallucinations in the answers, and improving the quality instead of using vanilla-supervised fine-tuned models. We employ the LLaVA open-source model to answer multimodal physics MCQs and compare the performance with and without using RLHF.
- [221] arXiv:2404.13038 [ pdf , ps , html , other ]
-
Title: Mapping Social Choice Theory to RLHFSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: Recent work on the limitations of using reinforcement learning from human feedback (RLHF) to incorporate human preferences into model behavior often raises social choice theory as a reference point. Social choice theory's analysis of settings such as voting mechanisms provides technical infrastructure that can inform how to aggregate human preferences amid disagreement. We analyze the problem settings of social choice and RLHF, identify key differences between them, and discuss how these differences may affect the RLHF interpretation of well-known technical results in social choice.
- [222] arXiv:2404.13150 [ pdf , ps , html , other ]
-
Title: Transformer Based Planning in the Observation Space with Applications to Trick Taking Card GamesSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Traditional search algorithms have issues when applied to games of imperfect information where the number of possible underlying states and trajectories are very large. This challenge is particularly evident in trick-taking card games. While state sampling techniques such as Perfect Information Monte Carlo (PIMC) search has shown success in these contexts, they still have major limitations.
We present Generative Observation Monte Carlo Tree Search (GO-MCTS), which utilizes MCTS on observation sequences generated by a game specific model. This method performs the search within the observation space and advances the search using a model that depends solely on the agent's observations. Additionally, we demonstrate that transformers are well-suited as the generative model in this context, and we demonstrate a process for iteratively training the transformer via population-based self-play.
The efficacy of GO-MCTS is demonstrated in various games of imperfect information, such as Hearts, Skat, and "The Crew: The Quest for Planet Nine," with promising results. - [223] arXiv:2404.13454 [ pdf , ps , other ]
-
Title: Revolutionizing System Reliability: The Role of AI in Predictive Maintenance StrategiesComments: Accepted, published and presented for the IARIA CLOUDCOMP2024 Conference of Venice, ItalyJournal-ref: In Proceedings of the IARIA CloudComputing 2024 Conference (pp. 1-9). Venice, Italy. ISSN: 2308-4294. ISBN: 978-1-68558-156-5Subjects: Artificial Intelligence (cs.AI) ; Performance (cs.PF); Systems and Control (eess.SY)
Abstract: The landscape of maintenance in distributed systems is rapidly evolving with the integration of Artificial Intelligence (AI). Also, as the complexity of computing continuum systems intensifies, the role of AI in predictive maintenance (Pd.M.) becomes increasingly pivotal. This paper presents a comprehensive survey of the current state of Pd.M. in the computing continuum, with a focus on the combination of scalable AI technologies. Recognizing the limitations of traditional maintenance practices in the face of increasingly complex and heterogenous computing continuum systems, the study explores how AI, especially machine learning and neural networks, is being used to enhance Pd.M. strategies. The survey encompasses a thorough review of existing literature, highlighting key advancements, methodologies, and case studies in the field. It critically examines the role of AI in improving prediction accuracy for system failures and in optimizing maintenance schedules, thereby contributing to reduced downtime and enhanced system longevity. By synthesizing findings from the latest advancements in the field, the article provides insights into the effectiveness and challenges of implementing AI-driven predictive maintenance. It underscores the evolution of maintenance practices in response to technological advancements and the growing complexity of computing continuum systems. The conclusions drawn from this survey are instrumental for practitioners and researchers in understanding the current landscape and future directions of Pd.M. in distributed systems. It emphasizes the need for continued research and development in this area, pointing towards a trend of more intelligent, efficient, and cost-effective maintenance solutions in the era of AI.
- [224] arXiv:2404.13501 [ pdf , ps , html , other ]
-
Title: A Survey on the Memory Mechanism of Large Language Model based AgentsZeyu Zhang , Xiaohe Bo , Chen Ma , Rui Li , Xu Chen , Quanyu Dai , Jieming Zhu , Zhenhua Dong , Ji-Rong WenComments: 39 pages, 5 figures, 4 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: Large language model (LLM) based agents have recently attracted much attention from the research and industry communities. Compared with original LLMs, LLM-based agents are featured in their self-evolving capability, which is the basis for solving real-world problems that need long-term and complex agent-environment interactions. The key component to support agent-environment interactions is the memory of the agents. While previous studies have proposed many promising memory mechanisms, they are scattered in different papers, and there lacks a systematical review to summarize and compare these works from a holistic perspective, failing to abstract common and effective designing patterns for inspiring future studies. To bridge this gap, in this paper, we propose a comprehensive survey on the memory mechanism of LLM-based agents. In specific, we first discuss ''what is'' and ''why do we need'' the memory in LLM-based agents. Then, we systematically review previous studies on how to design and evaluate the memory module. In addition, we also present many agent applications, where the memory module plays an important role. At last, we analyze the limitations of existing work and show important future directions. To keep up with the latest advances in this field, we create a repository at \url{ this https URL }.
- [225] arXiv:2404.13522 [ pdf , ps , html , other ]
-
Title: Error Analysis of Shapley Value-Based Model Explanations: An Informative PerspectiveSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: Shapley value attribution is an increasingly popular explainable AI (XAI) method, which quantifies the contribution of each feature to the model's output. However, recent work has shown that most existing methods to implement Shapley value attributions have some drawbacks. Due to these drawbacks, the resulting Shapley value attributions may provide biased or unreliable explanations, which fail to correctly capture the true intrinsic relationships between features and model outputs. Moreover, it is difficult to evaluate these explanation errors because the true underlying dependencies between features and model outputs are typically unknown. In this paper, we theoretically analyze the explanation errors of Shapley value attributions by decomposing the explanation error into two components: observation bias and structural bias. We also clarify the underlying causes of these two biases and demonstrate that there is a trade-off between them. Based on this error analysis framework, we develop two novel concepts: over-informative and under-informative explanations. We theoretically analyze the potential over-informativeness and under-informativeness of existing Shapley value attribution methods. Particularly for the widely deployed assumption-based Shapley value attributions, we affirm that they can easily be under-informative due to the distribution drift caused by distributional assumptions. We also propose a measurement tool to quantify the distribution drift that causes such errors.
- [226] arXiv:2404.13567 [ pdf , ps , html , other ]
-
Title: On the Value of Labeled Data and Symbolic Methods for Hidden Neuron Activation AnalysisAbhilekha Dalal , Rushrukh Rayan , Adrita Barua , Eugene Y. Vasserman , Md Kamruzzaman Sarker , Pascal HitzlerSubjects: Artificial Intelligence (cs.AI)
Abstract: A major challenge in Explainable AI is in correctly interpreting activations of hidden neurons: accurate interpretations would help answer the question of what a deep learning system internally detects as relevant in the input, demystifying the otherwise black-box nature of deep learning systems. The state of the art indicates that hidden node activations can, in some cases, be interpretable in a way that makes sense to humans, but systematic automated methods that would be able to hypothesize and verify interpretations of hidden neuron activations are underexplored. This is particularly the case for approaches that can both draw explanations from substantial background knowledge, and that are based on inherently explainable (symbolic) methods.
In this paper, we introduce a novel model-agnostic post-hoc Explainable AI method demonstrating that it provides meaningful interpretations. Our approach is based on using a Wikipedia-derived concept hierarchy with approximately 2 million classes as background knowledge, and utilizes OWL-reasoning-based Concept Induction for explanation generation. Additionally, we explore and compare the capabilities of off-the-shelf pre-trained multimodal-based explainable methods.
Our results indicate that our approach can automatically attach meaningful class expressions as explanations to individual neurons in the dense layer of a Convolutional Neural Network. Evaluation through statistical analysis and degree of concept activation in the hidden layer show that our method provides a competitive edge in both quantitative and qualitative aspects compared to prior work. - [227] arXiv:2404.13721 [ pdf , ps , other ]
-
Title: The Framework of a Design Process LanguageComments: PhD dissertation, 1993, Norwegian Institute of TechnologySubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: The thesis develops a view of design in a concept formation framework and outlines a language to describe both the object of the design and the process of designing. The unknown object at the outset of the design work may be seen as an unknown concept that the designer is to define. Throughout the process, she develops a description of this object by relating it to known concepts. The search stops when the designer is satisfied that the design specification is complete enough to satisfy the requirements from it once built. It is then a collection of propositions that all contribute towards defining the design object - a collection of sentences describing relationships between the object and known concepts. Also, the design process itself may be described by relating known concepts - by organizing known abilities into particular patterns of activation, or mobilization. In view of the demands posed to a language to use in this concept formation process, the framework of a Design Process Language (DPL) is developed. The basis for the language are linguistic categories that act as classes of relations used to combine concepts, containing relations used for describing process and object within the same general system, with some relations being process specific, others being object specific, and with the bulk being used both for process and object description. Another outcome is the distinction of modal relations, or relations describing futurity, possibility, willingness, hypothetical events, and the like. The design process almost always includes aspects such as these, and it is thus necessary for a language facilitating design process description to support such relationships to be constructed. The DPL is argued to be a foundation whereupon to build a language that can be used for enabling computers to be more useful - act more intelligently - in the design process.
- [228] arXiv:2404.13778 [ pdf , ps , html , other ]
-
Title: Multi-channel Emotion Analysis for Consensus Reaching in Group Movie Recommendation SystemsComments: the paper has been submitted for consideration to IEEESubjects: Artificial Intelligence (cs.AI)
Abstract: Watching movies is one of the social activities typically done in groups. Emotion is the most vital factor that affects movie viewers' preferences. So, the emotional aspect of the movie needs to be determined and analyzed for further recommendations. It can be challenging to choose a movie that appeals to the emotions of a diverse group. Reaching an agreement for a group can be difficult due to the various genres and choices. This paper proposes a novel approach to group movie suggestions by examining emotions from three different channels: movie descriptions (text), soundtracks (audio), and posters (image). We employ the Jaccard similarity index to match each participant's emotional preferences to prospective movie choices, followed by a fuzzy inference technique to determine group consensus. We use a weighted integration process for the fusion of emotion scores from diverse data types. Then, group movie recommendation is based on prevailing emotions and viewers' best-loved movies. After determining the recommendations, the group's consensus level is calculated using a fuzzy inference system, taking participants' feedback as input. Participants (n=130) in the survey were provided with different emotion categories and asked to select the emotions best suited for particular movies (n=12). Comparison results between predicted and actual scores demonstrate the efficiency of using emotion detection for this problem (Jaccard similarity index = 0.76). We explored the relationship between induced emotions and movie popularity as an additional experiment, analyzing emotion distribution in 100 popular movies from the TMDB database. Such systems can potentially improve the accuracy of movie recommendation systems and achieve a high level of consensus among participants with diverse preferences.
- [229] arXiv:2404.13973 [ pdf , ps , html , other ]
-
Title: DEQ-MCL: Discrete-Event Queue-based Monte-Carlo LocalizationComments: Accepted by AROB-ISBC-SWARM 2024Subjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Spatial cognition in hippocampal formation is posited to play a crucial role in the development of self-localization techniques for robots. In this paper, we propose a self-localization approach, DEQ-MCL, based on the discrete event queue hypothesis associated with phase precession within the hippocampal formation. Our method effectively estimates the posterior distribution of states, encompassing both past, present, and future states that are organized as a queue. This approach enables the smoothing of the posterior distribution of past states using current observations and the weighting of the joint distribution by considering the feasibility of future states. Our findings indicate that the proposed method holds promise for augmenting self-localization performance in indoor environments.
- [230] arXiv:2404.14050 [ pdf , ps , html , other ]
-
Title: Unlawful Proxy Discrimination: A Framework for Challenging Inherently Discriminatory AlgorithmsJournal-ref: 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24)Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: Emerging scholarship suggests that the EU legal concept of direct discrimination - where a person is given different treatment on grounds of a protected characteristic - may apply to various algorithmic decision-making contexts. This has important implications: unlike indirect discrimination, there is generally no 'objective justification' stage in the direct discrimination framework, which means that the deployment of directly discriminatory algorithms will usually be unlawful per se. In this paper, we focus on the most likely candidate for direct discrimination in the algorithmic context, termed inherent direct discrimination, where a proxy is inextricably linked to a protected characteristic. We draw on computer science literature to suggest that, in the algorithmic context, 'treatment on the grounds of' needs to be understood in terms of two steps: proxy capacity and proxy use. Only where both elements can be made out can direct discrimination be said to be `on grounds of' a protected characteristic. We analyse the legal conditions of our proposed proxy capacity and proxy use tests. Based on this analysis, we discuss technical approaches and metrics that could be developed or applied to identify inherent direct discrimination in algorithmic decision-making.
- [231] arXiv:2404.14068 [ pdf , ps , html , other ]
-
Title: Holistic Safety and Responsibility Evaluations of Advanced AI ModelsLaura Weidinger , Joslyn Barnhart , Jenny Brennan , Christina Butterfield , Susie Young , Will Hawkins , Lisa Anne Hendricks , Ramona Comanescu , Oscar Chang , Mikel Rodriguez , Jennifer Beroshi , Dawn Bloxwich , Lev Proleev , Jilin Chen , Sebastian Farquhar , Lewis Ho , Iason Gabriel , Allan Dafoe , William IsaacComments: 10 pages excluding bibliographySubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Safety and responsibility evaluations of advanced AI models are a critical but developing field of research and practice. In the development of Google DeepMind's advanced AI models, we innovated on and applied a broad set of approaches to safety evaluation. In this report, we summarise and share elements of our evolving approach as well as lessons learned for a broad audience. Key lessons learned include: First, theoretical underpinnings and frameworks are invaluable to organise the breadth of risk domains, modalities, forms, metrics, and goals. Second, theory and practice of safety evaluation development each benefit from collaboration to clarify goals, methods and challenges, and facilitate the transfer of insights between different stakeholders and disciplines. Third, similar key methods, lessons, and institutions apply across the range of concerns in responsibility and safety - including established and emerging harms. For this reason it is important that a wide range of actors working on safety evaluation and safety research communities work together to develop, refine and implement novel evaluation approaches and best practices, rather than operating in silos. The report concludes with outlining the clear need to rapidly advance the science of evaluations, to integrate new evaluations into the development and governance of AI, to establish scientifically-grounded norms and standards, and to promote a robust evaluation ecosystem.
- [232] arXiv:2404.14082 [ pdf , ps , html , other ]
-
Title: Mechanistic Interpretability for AI Safety -- A ReviewSubjects: Artificial Intelligence (cs.AI)
Abstract: Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable.
- [233] arXiv:2404.14304 [ pdf , ps , other ]
-
Title: Explaining Arguments' Strength: Unveiling the Role of Attacks and Supports (Technical Report)Comments: This paper has been accepted at IJCAI 2024 (the 33rd International Joint Conference on Artificial Intelligence)Subjects: Artificial Intelligence (cs.AI)
Abstract: Quantitatively explaining the strength of arguments under gradual semantics has recently received increasing attention. Specifically, several works in the literature provide quantitative explanations by computing the attribution scores of arguments. These works disregard the importance of attacks and supports, even though they play an essential role when explaining arguments' strength. In this paper, we propose a novel theory of Relation Attribution Explanations (RAEs), adapting Shapley values from game theory to offer fine-grained insights into the role of attacks and supports in quantitative bipolar argumentation towards obtaining the arguments' strength. We show that RAEs satisfy several desirable properties. We also propose a probabilistic algorithm to approximate RAEs efficiently. Finally, we show the application value of RAEs in fraud detection and large language models case studies.
- [234] arXiv:2404.14394 [ pdf , ps , html , other ]
-
Title: A Multimodal Automated Interpretability AgentTamar Rott Shaham , Sarah Schwettmann , Franklin Wang , Achyuta Rajaram , Evan Hernandez , Jacob Andreas , Antonio TorralbaComments: 25 pages, 13 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.
- [235] arXiv:2404.14450 [ pdf , ps , html , other ]
-
Title: GraphMatcher: A Graph Representation Learning Approach for Ontology MatchingComments: The 17th International Workshop on Ontology Matching, The 21st International Semantic Web Conference (ISWC) 2022, 23 October 2022, Hangzhou, ChinaJournal-ref: The 17th International Workshop on Ontology Matching, The 21st International Semantic Web Conference (ISWC) 2022, 23 October 2022, Hangzhou, ChinaSubjects: Artificial Intelligence (cs.AI)
Abstract: Ontology matching is defined as finding a relationship or correspondence between two or more entities in two or more ontologies. To solve the interoperability problem of the domain ontologies, semantically similar entities in these ontologies must be found and aligned before merging them. GraphMatcher, developed in this study, is an ontology matching system using a graph attention approach to compute higher-level representation of a class together with its surrounding terms. The GraphMatcher has obtained remarkable results in in the Ontology Alignment Evaluation Initiative (OAEI) 2022 conference track. Its codes are available at ~\url{ this https URL }.
- [236] arXiv:2404.14786 [ pdf , ps , html , other ]
-
Title: LLM-Enhanced Causal Discovery in Temporal Domain from Interventional DataPeiwen Li , Xin Wang , Zeyang Zhang , Yuan Meng , Fang Shen , Yue Li , Jialong Wang , Yang Li , Wenweu ZhuSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional data. However, existing methods mainly focus on synthetic datasets with heavy reliance on intervention targets and ignore the textual information hidden in real-world systems, failing to conduct causal discovery for real industrial scenarios. To tackle this problem, in this paper we propose to investigate temporal causal discovery in industrial scenarios, which faces two critical challenges: 1) how to discover causal relationships without the interventional targets that are costly to obtain in practice, and 2) how to discover causal relations via leveraging the textual information in systems which can be complex yet abundant in industrial contexts. To address these challenges, we propose the RealTCD framework, which is able to leverage domain knowledge to discover temporal causal relationships without interventional targets. Specifically, we first develop a score-based temporal causal discovery method capable of discovering causal relations for root cause analysis without relying on interventional targets through strategic masking and regularization. Furthermore, by employing Large Language Models (LLMs) to handle texts and integrate domain knowledge, we introduce LLM-guided meta-initialization to extract the meta-knowledge from textual information hidden in systems to boost the quality of discovery. We conduct extensive experiments on simulation and real-world datasets to show the superiority of our proposed RealTCD framework over existing baselines in discovering temporal causal structures.
- [237] arXiv:2404.15059 [ pdf , ps , other ]
-
Title: Using deep reinforcement learning to promote sustainable human behaviour on a common pool resource problemRaphael Koster , Miruna Pîslar , Andrea Tacchetti , Jan Balaguer , Leqi Liu , Romuald Elie , Oliver P. Hauser , Karl Tuyls , Matt Botvinick , Christopher SummerfieldSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
Abstract: A canonical social dilemma arises when finite resources are allocated to a group of people, who can choose to either reciprocate with interest, or keep the proceeds for themselves. What resource allocation mechanisms will encourage levels of reciprocation that sustain the commons? Here, in an iterated multiplayer trust game, we use deep reinforcement learning (RL) to design an allocation mechanism that endogenously promotes sustainable contributions from human participants to a common pool resource. We first trained neural networks to behave like human players, creating a stimulated economy that allowed us to study how different mechanisms influenced the dynamics of receipt and reciprocation. We then used RL to train a social planner to maximise aggregate return to players. The social planner discovered a redistributive policy that led to a large surplus and an inclusive economy, in which players made roughly equal gains. The RL agent increased human surplus over baseline mechanisms based on unrestricted welfare or conditional cooperation, by conditioning its generosity on available resources and temporarily sanctioning defectors by allocating fewer resources to them. Examining the AI policy allowed us to develop an explainable mechanism that performed similarly and was more popular among players. Deep reinforcement learning can be used to discover mechanisms that promote sustainable human behaviour.
- [238] arXiv:2404.15184 [ pdf , ps , html , other ]
-
Title: Reducing Human-Robot Goal State Divergence with Environment DesignComments: 8 pages, 1 figureSubjects: Artificial Intelligence (cs.AI)
Abstract: One of the most difficult challenges in creating successful human-AI collaborations is aligning a robot's behavior with a human user's expectations. When this fails to occur, a robot may misinterpret their specified goals, prompting it to perform actions with unanticipated, potentially dangerous side effects. To avoid this, we propose a new metric we call Goal State Divergence $\mathcal{(GSD)}$, which represents the difference between a robot's final goal state and the one a human user expected. In cases where $\mathcal{GSD}$ cannot be directly calculated, we show how it can be approximated using maximal and minimal bounds. We then input the $\mathcal{GSD}$ value into our novel human-robot goal alignment (HRGA) design problem, which identifies a minimal set of environment modifications that can prevent mismatches like this. To show the effectiveness of $\mathcal{GSD}$ for reducing differences between human-robot goal states, we empirically evaluate our approach on several standard benchmarks.
- [239] arXiv:2404.15189 [ pdf , ps , html , other ]
-
Title: Text2Grasp: Grasp synthesis by text prompts of object grasping partsSubjects: Artificial Intelligence (cs.AI)
Abstract: The hand plays a pivotal role in human ability to grasp and manipulate objects and controllable grasp synthesis is the key for successfully performing downstream tasks. Existing methods that use human intention or task-level language as control signals for grasping inherently face ambiguity. To address this challenge, we propose a grasp synthesis method guided by text prompts of object grasping parts, Text2Grasp, which provides more precise control. Specifically, we present a two-stage method that includes a text-guided diffusion model TextGraspDiff to first generate a coarse grasp pose, then apply a hand-object contact optimization process to ensure both plausibility and diversity. Furthermore, by leveraging Large Language Model, our method facilitates grasp synthesis guided by task-level and personalized text descriptions without additional manual annotations. Extensive experiments demonstrate that our method achieves not only accurate part-level grasp control but also comparable performance in grasp quality.
- [240] arXiv:2404.15190 [ pdf , ps , html , other ]
-
Title: Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction FollowingComments: 14 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract: Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information.
- [241] arXiv:2404.15192 [ pdf , ps , html , other ]
-
Title: Measuring Diversity of Game ScenariosSubjects: Artificial Intelligence (cs.AI)
Abstract: This survey comprehensively reviews the multi-dimensionality of game scenario diversity, spotlighting the innovative use of procedural content generation and other fields as cornerstones for enriching player experiences through diverse game scenarios. By traversing a wide array of disciplines, from affective modeling and multi-agent systems to psychological studies, our research underscores the importance of diverse game scenarios in gameplay and education. Through a taxonomy of diversity metrics and evaluation methods, we aim to bridge the current gaps in literature and practice, offering insights into effective strategies for measuring and integrating diversity in game scenarios. Our analysis highlights the necessity for a unified taxonomy to aid developers and researchers in crafting more engaging and varied game worlds. This survey not only charts a path for future research in diverse game scenarios but also serves as a handbook for industry practitioners seeking to leverage diversity as a key component of game design and development.
- [242] arXiv:2404.15492 [ pdf , ps , html , other ]
-
Title: Multi-scale Intervention Planning based on Generative DesignIoannis Kavouras , Ioannis Rallis , Emmanuel Sardis , Eftychios Protopapadakis , Anastasios Doulamis , Nikolaos DoulamisSubjects: Artificial Intelligence (cs.AI)
Abstract: The scarcity of green spaces, in urban environments, consists a critical challenge. There are multiple adverse effects, impacting the health and well-being of the citizens. Small scale interventions, e.g. pocket parks, is a viable solution, but comes with multiple constraints, involving the design and implementation over a specific area. In this study, we harness the capabilities of generative AI for multi-scale intervention planning, focusing on nature based solutions. By leveraging image-to-image and image inpainting algorithms, we propose a methodology to address the green space deficit in urban areas. Focusing on two alleys in Thessaloniki, where greenery is lacking, we demonstrate the efficacy of our approach in visualizing NBS interventions. Our findings underscore the transformative potential of emerging technologies in shaping the future of urban intervention planning processes.
- [243] arXiv:2404.15500 [ pdf , ps , html , other ]
-
Title: GeoLLM-Engine: A Realistic Environment for Building Geospatial CopilotsComments: Earthvision 2024, CVPR WorkshopSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
- [244] arXiv:2404.15583 [ pdf , ps , html , other ]
-
Title: Multi-Agent Reinforcement Learning for Energy Networks: Computational Challenges, Progress and Open ProblemsSubjects: Artificial Intelligence (cs.AI)
Abstract: The rapidly changing architecture and functionality of electrical networks and the increasing penetration of renewable and distributed energy resources have resulted in various technological and managerial challenges. These have rendered traditional centralized energy-market paradigms insufficient due to their inability to support the dynamic and evolving nature of the network. This survey explores how multi-agent reinforcement learning (MARL) can support the decentralization and decarbonization of energy networks and mitigate the associated challenges. This is achieved by specifying key computational challenges in managing energy networks, reviewing recent research progress on addressing them, and highlighting open challenges that may be addressed using MARL.
- [245] arXiv:2404.15822 [ pdf , ps , html , other ]
-
Title: Recursive Backwards Q-Learning in Deterministic EnvironmentsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.
- [246] arXiv:2404.15923 [ pdf , ps , html , other ]
-
Title: KGValidator: A Framework for Automatic Validation of Knowledge Graph ConstructionJack Boylan , Shashank Mangla , Dominic Thorn , Demian Gholipour Ghalandari , Parsa Ghaffari , Chris HokampComments: Text2KG 2024, ESWC 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This study explores the use of Large Language Models (LLMs) for automatic evaluation of knowledge graph (KG) completion models. Historically, validating information in KGs has been a challenging task, requiring large-scale human annotation at prohibitive cost. With the emergence of general-purpose generative AI and LLMs, it is now plausible that human-in-the-loop validation could be replaced by a generative agent. We introduce a framework for consistency and validation when using generative models to validate knowledge graphs. Our framework is based upon recent open-source developments for structural and semantic validation of LLM outputs, and upon flexible approaches to fact checking and verification, supported by the capacity to reference external knowledge sources of any kind. The design is easy to adapt and extend, and can be used to verify any kind of graph-structured data through a combination of model-intrinsic knowledge, user-supplied context, and agents capable of external knowledge retrieval.
- [247] arXiv:2404.16055 [ pdf , ps , html , other ]
-
Title: Assessing Climate Transition Risks in the Colombian Processed Food Sector: A Fuzzy Logic and Multicriteria Decision-Making ApproachJuan F. Pérez-Pérez , Pablo Isaza Gómez , Isis Bonet , María Solange Sánchez-Pinzón , Fabio Caraffini , Christian LochmullerSubjects: Artificial Intelligence (cs.AI)
Abstract: Climate risk assessment is becoming increasingly important. For organisations, identifying and assessing climate-related risks is challenging, as they can come from multiple sources. This study identifies and assesses the main climate transition risks in the colombian processed food sector. As transition risks are vague, our approach uses Fuzzy Logic and compares it to various multi-criteria decision-making methods to classify the different climate transition risks an organisation may be exposed to. This approach allows us to use linguistic expressions for risk analysis and to better describe risks and their consequences. The results show that the risks ranked as the most critical for this organisation in their order were price volatility and raw materials availability, the change to less carbon-intensive production or consumption patterns, the increase in carbon taxes and technological change, and the associated development or implementation costs. These risks show a critical risk level, which implies that they are the most significant risks for the organisation in the case study. These results highlight the importance of investments needed to meet regulatory requirements, which are the main drivers for organisations at the financial level.
- [248] arXiv:2404.16068 [ pdf , ps , html , other ]
-
Title: SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common SenseSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9: BRAIN-TEASER(S), the first task at this competition designed to test the system's reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)'s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally. We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models.
- [249] arXiv:2404.16072 [ pdf , ps , html , other ]
-
Title: Playing Board Games with the Predict Results of Beam Search AlgorithmComments: 8 pages, 4 figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: This paper introduces a novel algorithm for two-player deterministic games with perfect information, which we call PROBS (Predict Results of Beam Search). Unlike existing methods that predominantly rely on Monte Carlo Tree Search (MCTS) for decision processes, our approach leverages a simpler beam search algorithm. We evaluate the performance of our algorithm across a selection of board games, where it consistently demonstrates an increased winning ratio against baseline opponents. A key result of this study is that the PROBS algorithm operates effectively, even when the beam search size is considerably smaller than the average number of turns in the game.
- [250] arXiv:2404.16078 [ pdf , ps , html , other ]
-
Title: Learning World Models With Hierarchical Temporal Abstractions: A Probabilistic PerspectiveComments: Doctoral Dissertation Preprint, Department of Computer Science, Karlsruhe Institute Of Technology, 2024Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Machines that can replicate human intelligence with type 2 reasoning capabilities should be able to reason at multiple levels of spatio-temporal abstractions and scales using internal world models. Devising formalisms to develop such internal world models, which accurately reflect the causal hierarchies inherent in the dynamics of the real world, is a critical research challenge in the domains of artificial intelligence and machine learning. This thesis identifies several limitations with the prevalent use of state space models (SSMs) as internal world models and propose two new probabilistic formalisms namely Hidden-Parameter SSMs and Multi-Time Scale SSMs to address these drawbacks. The structure of graphical models in both formalisms facilitates scalable exact probabilistic inference using belief propagation, as well as end-to-end learning via backpropagation through time. This approach permits the development of scalable, adaptive hierarchical world models capable of representing nonstationary dynamics across multiple temporal abstractions and scales. Moreover, these probabilistic formalisms integrate the concept of uncertainty in world states, thus improving the system's capacity to emulate the stochastic nature of the real world and quantify the confidence in its predictions. The thesis also discuss how these formalisms are in line with related neuroscience literature on Bayesian brain hypothesis and predicitive processing. Our experiments on various real and simulated robots demonstrate that our formalisms can match and in many cases exceed the performance of contemporary transformer variants in making long-range future predictions. We conclude the thesis by reflecting on the limitations of our current models and suggesting directions for future research.
- [251] arXiv:2404.16206 [ pdf , ps , html , other ]
-
Title: Knowledge Graph Completion using Structural and Textual EmbeddingsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Knowledge Graphs (KGs) are widely employed in artificial intelligence applications, such as question-answering and recommendation systems. However, KGs are frequently found to be incomplete. While much of the existing literature focuses on predicting missing nodes for given incomplete KG triples, there remains an opportunity to complete KGs by exploring relations between existing nodes, a task known as relation prediction. In this study, we propose a relations prediction model that harnesses both textual and structural information within KGs. Our approach integrates walks-based embeddings with language model embeddings to effectively represent nodes. We demonstrate that our model achieves competitive results in the relation prediction task when evaluated on a widely used dataset.
- [252] arXiv:2404.16364 [ pdf , ps , html , other ]
-
Title: ReZero: Boosting MCTS-based Algorithms by Just-in-Time and Speedy ReanalyzeSubjects: Artificial Intelligence (cs.AI)
Abstract: MCTS-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost MCTS-based algorithms. Specifically, we propose a new scheme that simplifies data collecting and reanalyzing, which significantly reduces the search cost while guarantees the performance as well. Furthermore, to accelerate each search process, we conceive a method to reuse the subsequent information in the trajectory. The corresponding analysis conducted on the bandit model also provides auxiliary theoretical substantiation for our design. Experiments conducted on Atari environments and board games demonstrates that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero benchmark at this https URL .
- [253] arXiv:2404.16379 [ pdf , ps , html , other ]
-
Title: Optimal and Bounded Suboptimal Any-Angle Multi-agent PathfindingSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA); Robotics (cs.RO)
Abstract: Multi-agent pathfinding (MAPF) is the problem of finding a set of conflict-free paths for a set of agents. Typically, the agents' moves are limited to a pre-defined graph of possible locations and allowed transitions between them, e.g. a 4-neighborhood grid. We explore how to solve MAPF problems when each agent can move between any pair of possible locations as long as traversing the line segment connecting them does not lead to the collision with the obstacles. This is known as any-angle pathfinding. We present the first optimal any-angle multi-agent pathfinding algorithm. Our planner is based on the Continuous Conflict-based Search (CCBS) algorithm and an optimal any-angle variant of the Safe Interval Path Planning (TO-AA-SIPP). The straightforward combination of those, however, scales poorly since any-angle path finding induces search trees with a very large branching factor. To mitigate this, we adapt two techniques from classical MAPF to the any-angle setting, namely Disjoint Splitting and Multi-Constraints. Experimental results on different combinations of these techniques show they enable solving over 30% more problems than the vanilla combination of CCBS and TO-AA-SIPP. In addition, we present a bounded-suboptimal variant of our algorithm, that enables trading runtime for solution cost in a controlled manner.
- [254] arXiv:2404.16411 [ pdf , ps , html , other ]
-
Title: Label-Free Topic-Focused Summarization Using Query AugmentationSubjects: Artificial Intelligence (cs.AI)
Abstract: In today's data and information-rich world, summarization techniques are essential in harnessing vast text to extract key information and enhance decision-making and efficiency. In particular, topic-focused summarization is important due to its ability to tailor content to specific aspects of an extended text. However, this usually requires extensive labelled datasets and considerable computational power. This study introduces a novel method, Augmented-Query Summarization (AQS), for topic-focused summarization without the need for extensive labelled datasets, leveraging query augmentation and hierarchical clustering. This approach facilitates the transferability of machine learning models to the task of summarization, circumventing the need for topic-specific training. Through real-world tests, our method demonstrates the ability to generate relevant and accurate summaries, showing its potential as a cost-effective solution in data-rich environments. This innovation paves the way for broader application and accessibility in the field of topic-focused summarization technology, offering a scalable, efficient method for personalized content extraction.
- [255] arXiv:2404.16534 [ pdf , ps , html , other ]
-
Title: SIDEs: Separating Idealization from Deceptive Explanations in xAIComments: 18 pages, 3 figures, 2 tables Forthcoming in FAccT'24Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: Explainable AI (xAI) methods are important for establishing trust in using black-box models. However, recent criticism has mounted against current xAI methods that they disagree, are necessarily false, and can be manipulated, which has started to undermine the deployment of black-box models. Rudin (2019) goes so far as to say that we should stop using black-box models altogether in high-stakes cases because xAI explanations "must be wrong". However, strict fidelity to the truth is historically not a desideratum in science. Idealizations -- the intentional distortions introduced to scientific theories and models -- are commonplace in the natural sciences and are seen as a successful scientific tool. Thus, it is not falsehood qua falsehood that is the issue. In this paper, I outline the need for xAI research to engage in idealization evaluation. Drawing on the use of idealizations in the natural sciences and philosophy of science, I introduce a novel framework for evaluating whether xAI methods engage in successful idealizations or deceptive explanations (SIDEs). SIDEs evaluates whether the limitations of xAI methods, and the distortions that they introduce, can be part of a successful idealization or are indeed deceptive distortions as critics suggest. I discuss the role that existing research can play in idealization evaluation and where innovation is necessary. Through a qualitative analysis we find that leading feature importance methods and counterfactual explanations are subject to idealization failure and suggest remedies for ameliorating idealization failure.
- [256] arXiv:2404.16579 [ pdf , ps , html , other ]
-
Title: Neural Interaction Energy for Multi-Agent Trajectory PredictionSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.
- [257] arXiv:2404.16689 [ pdf , ps , html , other ]
-
Title: Learning to Beat ByteRL: Exploitability of Collectible Card Game AgentsSubjects: Artificial Intelligence (cs.AI)
Abstract: While Poker, as a family of games, has been studied extensively in the last decades, collectible card games have seen relatively little attention. Only recently have we seen an agent that can compete with professional human players in Hearthstone, one of the most popular collectible card games. Although artificial agents must be able to work with imperfect information in both of these genres, collectible card games pose another set of distinct challenges. Unlike in many poker variants, agents must deal with state space so vast that even enumerating all states consistent with the agent's beliefs is intractable, rendering the current search methods unusable and requiring the agents to opt for other techniques. In this paper, we investigate the strength of such techniques for this class of games. Namely, we present preliminary analysis results of ByteRL, the state-of-the-art agent in Legends of Code and Magic and Hearthstone. Although ByteRL beat a top-10 Hearthstone player from China, we show that its play in Legends of Code and Magic is highly exploitable.
- [258] arXiv:2404.16721 [ pdf , ps , html , other ]
-
Title: Distilling Privileged Information for Dubins Traveling Salesman Problems with NeighborhoodsComments: 7 pages, 4 figures, double blind under reviewSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
- [259] arXiv:2404.16957 [ pdf , ps , html , other ]
-
Title: Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for AccountabilitySubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility attribution challenging. To this end, this work proposes a Computational Reflective Equilibrium (CRE) approach to establish a coherent and ethically acceptable responsibility attribution framework for all stakeholders. The computational approach provides a structured analysis that overcomes the limitations of conceptual approaches in dealing with dynamic and multifaceted scenarios, showcasing the framework's explainability, coherence, and adaptivity properties in the responsibility attribution process. We examine the pivotal role of the initial activation level associated with claims in equilibrium computation. Using an AI-assisted medical decision-support system as a case study, we illustrate how different initializations lead to diverse responsibility distributions. The framework offers valuable insights into accountability in AI-induced incidents, facilitating the development of a sustainable and resilient system through continuous monitoring, revision, and reflection.
- [260] arXiv:2404.17053 [ pdf , ps , html , other ]
-
Title: Agentive Permissions in Multiagent SystemsComments: The 33rd International Joint Conference on Artificial Intelligence (IJCAI-24)Subjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: This paper proposes to distinguish four forms of agentive permissions in multiagent settings. The main technical results are the complexity analysis of model checking, the semantic undefinability of modalities that capture these forms of permissions through each other, and a complete logical system capturing the interplay between these modalities.
- [261] arXiv:2404.17129 [ pdf , ps , other ]
-
Title: Process Mining Embeddings: Learning Vector Representations for Petri NetsSubjects: Artificial Intelligence (cs.AI)
Abstract: Process mining offers powerful techniques for discovering, analyzing, and enhancing real-world business processes. In this context, Petri nets provide an expressive means of modeling process behavior. However, directly analyzing and comparing intricate Petri net presents challenges. This study introduces PetriNet2Vec, a novel unsupervised methodology based on Natural Language Processing concepts inspired by Doc2Vec and designed to facilitate the effective comparison, clustering, and classification of process models represented as embedding vectors. These embedding vectors allow us to quantify similarities and relationships between different process models. Our methodology was experimentally validated using the PDC Dataset, featuring 96 diverse Petri net models. We performed cluster analysis, created UMAP visualizations, and trained a decision tree to provide compelling evidence for the capability of PetriNet2Vec to discern meaningful patterns and relationships among process models and their constituent tasks. Through a series of experiments, we demonstrated that PetriNet2Vec was capable of learning the structure of Petri nets, as well as the main properties used to simulate the process models of our dataset. Furthermore, our results showcase the utility of the learned embeddings in two crucial downstream tasks within process mining enhancement: process classification and process retrieval.
- [262] arXiv:2404.17316 [ pdf , ps , html , other ]
-
Title: Certified MaxSAT PreprocessingSubjects: Artificial Intelligence (cs.AI)
Abstract: Building on the progress in Boolean satisfiability (SAT) solving over the last decades, maximum satisfiability (MaxSAT) has become a viable approach for solving NP-hard optimization problems, but ensuring correctness of MaxSAT solvers has remained an important concern. For SAT, this is largely a solved problem thanks to the use of proof logging, meaning that solvers emit machine-verifiable proofs of (un)satisfiability to certify correctness. However, for MaxSAT, proof logging solvers have started being developed only very recently. Moreover, these nascent efforts have only targeted the core solving process, ignoring the preprocessing phase where input problem instances can be substantially reformulated before being passed on to the solver proper. In this work, we demonstrate how pseudo-Boolean proof logging can be used to certify the correctness of a wide range of modern MaxSAT preprocessing techniques. By combining and extending the VeriPB and CakePB tools, we provide formally verified, end-to-end proof checking that the input and preprocessed output MaxSAT problem instances have the same optimal value. An extensive evaluation on applied MaxSAT benchmarks shows that our approach is feasible in practice.
- [263] arXiv:2404.17524 [ pdf , ps , html , other ]
-
Title: On the Use of Large Language Models to Generate Capability OntologiesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Capability ontologies are increasingly used to model functionalities of systems or machines. The creation of such ontological models with all properties and constraints of capabilities is very complex and can only be done by ontology experts. However, Large Language Models (LLMs) have shown that they can generate machine-interpretable models from natural language text input and thus support engineers / ontology experts. Therefore, this paper investigates how LLMs can be used to create capability ontologies. We present a study with a series of experiments in which capabilities with varying complexities are generated using different prompting techniques and with different LLMs. Errors in the generated ontologies are recorded and compared. To analyze the quality of the generated ontologies, a semi-automated approach based on RDF syntax checking, OWL reasoning, and SHACL constraints is used. The results of this study are very promising because even for complex capabilities, the generated ontologies are almost free of errors.
- [264] arXiv:2404.17648 [ pdf , ps , html , other ]
-
Title: Consolidating LAMA with Best-First Width SearchSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: One key decision for heuristic search algorithms is how to balance exploration and exploitation. In classical planning, novelty search has come out as the most successful approach in this respect. The idea is to favor states that contain previously unseen facts when searching for a plan. This is done by maintaining a record of the tuples of facts observed in previous states. Then the novelty of a state is the size of the smallest previously unseen tuple. The most successful version of novelty search is best-first width search (BFWS), which combines novelty measures with heuristic estimates. An orthogonal approach to balance exploration-exploitation is to use several open-lists. These open-lists are ordered using different heuristic estimates, which diversify the information used in the search. The search algorithm then alternates between these open-lists, trying to exploit these different estimates. This is the approach used by LAMA, a classical planner that, a decade after its release, is still considered state-of-the-art in agile planning. In this paper, we study how to combine LAMA and BFWS. We show that simply adding the strongest open-list used in BFWS to LAMA harms performance. However, we show that combining only parts of each planner leads to a new state-of-the-art agile planner.
- [265] arXiv:2404.17716 [ pdf , ps , other ]
-
Title: Airlift Challenge: A Competition for Optimizing Cargo DeliverySubjects: Artificial Intelligence (cs.AI)
Abstract: Airlift operations require the timely distribution of various cargo, much of which is time sensitive and valuable. However, these operations have to contend with sudden disruptions from weather and malfunctions, requiring immediate rescheduling. The Airlift Challenge competition seeks possible solutions via a simulator that provides a simplified abstraction of the airlift problem. The simulator uses an OpenAI gym interface that allows participants to create an algorithm for planning agent actions. The algorithm is scored using a remote evaluator against scenarios of ever-increasing difficulty. The second iteration of the competition was underway from November 2023 to April 2024. In this paper, we describe the competition and simulation environment. As a step towards applying generalized planning techniques to the problem, we present a temporal PDDL domain for the Pickup and Delivery Problem, a model which lies at the core of the Airlift Challenge.
- [266] arXiv:2404.17725 [ pdf , ps , html , other ]
-
Title: Boltzmann State-Dependent RationalitySubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: This paper expands on existing learned models of human behavior via a measured step in structured irrationality. Specifically, by replacing the suboptimality constant $\beta$ in a Boltzmann rationality model with a function over states $\beta(s)$, we gain natural expressivity in a computationally tractable manner. This paper discusses relevant mathematical theory, sets up several experimental designs, presents limited preliminary results, and proposes future investigations.
- [267] arXiv:2404.17749 [ pdf , ps , html , other ]
-
Title: UMass-BioNLP at MEDIQA-M3G 2024: DermPrompt -- A Systematic Exploration of Prompt Engineering with GPT-4V for Dermatological DiagnosisParth Vashisht , Abhilasha Lodha , Mukta Maddipatla , Zonghai Yao , Avijit Mitra , Zhichao Yang , Junda Wang , Sunjae Kwon , Hong YuComments: Accepted at NAACL-ClinicalNLP workshop 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.
- [268] arXiv:2404.17757 [ pdf , ps , other ]
-
Title: Middle Architecture CriteriaComments: 14 pagesSubjects: Artificial Intelligence (cs.AI) ; Databases (cs.DB); Logic in Computer Science (cs.LO)
Abstract: Mid-level ontologies are used to integrate terminologies and data across disparate domains. There are, however, no clear, defensible criteria for determining whether a given ontology should count as mid-level, because we lack a rigorous characterization of what the middle level of generality is supposed to contain. Attempts to provide such a characterization have failed, we believe, because they have focused on the goal of specifying what is characteristic of those single ontologies that have been advanced as mid-level ontologies. Unfortunately, single ontologies of this sort are generally a mixture of top- and mid-level, and sometimes even of domain-level terms. To gain clarity, we aim to specify the necessary and sufficient conditions for a collection of one or more ontologies to inhabit what we call a mid-level architecture.
- [269] arXiv:2404.17758 [ pdf , ps , other ]
-
Title: The Common Core OntologiesComments: 13 pagesSubjects: Artificial Intelligence (cs.AI) ; Databases (cs.DB); Logic in Computer Science (cs.LO)
Abstract: The Common Core Ontologies (CCO) are designed as a mid-level ontology suite that extends the Basic Formal Ontology. CCO has since been increasingly adopted by a broad group of users and applications and is proposed as the first standard mid-level ontology. Despite these successes, documentation of the contents and design patterns of the CCO has been comparatively minimal. This paper is a step toward providing enhanced documentation for the mid-level ontology suite through a discussion of the contents of the eleven ontologies that collectively comprise the Common Core Ontology suite.
- [270] arXiv:2404.17833 [ pdf , ps , html , other ]
-
Title: Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User InputsSubjects: Artificial Intelligence (cs.AI) ; Programming Languages (cs.PL)
Abstract: Agents based on large language models (LLMs) have demonstrated effectiveness in solving a wide range of tasks by integrating LLMs with key modules such as planning, memory, and tool usage. Increasingly, customers are adopting LLM agents across a variety of commercial applications critical to reliability, including support for mental well-being, chemical synthesis, and software development. Nevertheless, our observations and daily use of LLM agents indicate that they are prone to making erroneous plans, especially when the tasks are complex and require long-term planning.
In this paper, we propose PDoctor, a novel and automated approach to testing LLM agents and understanding their erroneous planning. As the first work in this direction, we formulate the detection of erroneous planning as a constraint satisfiability problem: an LLM agent's plan is considered erroneous if its execution violates the constraints derived from the user inputs. To this end, PDoctor first defines a domain-specific language (DSL) for user queries and synthesizes varying inputs with the assistance of the Z3 constraint solver. These synthesized inputs are natural language paragraphs that specify the requirements for completing a series of tasks. Then, PDoctor derives constraints from these requirements to form a testing oracle. We evaluate PDoctor with three mainstream agent frameworks and two powerful LLMs (GPT-3.5 and GPT-4). The results show that PDoctor can effectively detect diverse errors in agent planning and provide insights and error characteristics that are valuable to both agent developers and users. We conclude by discussing potential alternative designs and directions to extend PDoctor. - [271] arXiv:2404.17924 [ pdf , ps , html , other ]
-
Title: Results about sets of desirable gamble setsSubjects: Artificial Intelligence (cs.AI) ; Computer Science and Game Theory (cs.GT)
Abstract: Coherent sets of desirable gamble sets is used as a model for representing an agents opinions and choice preferences under uncertainty. In this paper we provide some results about the axioms required for coherence and the natural extension of a given set of desirable gamble sets. We also show that coherent sets of desirable gamble sets can be represented by a proper filter of coherent sets of desirable gambles.
- [272] arXiv:2404.17977 [ pdf , ps , html , other ]
-
Title: Advancing Healthcare Automation: Multi-Agent Systems for Medical Necessity JustificationComments: 5 pages, 4 figures. Work In ProgressSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: This paper explores the application of Swarm-Structured Multi-Agent Systems (MAS) to establish medical necessity, a process that involves a systematic review of patient-specific medical structured and unstructured data against clinical guidelines. We addressed this complex task by decomposing it into smaller, more manageable sub-tasks. Each sub-task is handled by a specialized AI agent. We conduct a systematic study of the impact of various prompting strategies on these agents and benchmark different Large Language Models (LLMs) to determine their accuracy in completing these tasks. Additionally, we investigate how these agents can provide explainability, thereby enhancing trust and transparency within the system.
- [273] arXiv:2404.18021 [ pdf , ps , html , other ]
-
Title: CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing ExperimentsKaixuan Huang , Yuanhao Qu , Henry Cousins , William A. Johnson , Di Yin , Mihir Shah , Denny Zhou , Russ Altman , Mengdi Wang , Le CongSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Quantitative Methods (q-bio.QM)
Abstract: The introduction of genome engineering technology has transformed biomedical research, making it possible to make precise changes to genetic information. However, creating an efficient gene-editing system requires a deep understanding of CRISPR technology, and the complex experimental systems under investigation. While Large Language Models (LLMs) have shown promise in various tasks, they often lack specific knowledge and struggle to accurately solve biological design problems. In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments. CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes. We showcase the potential of CRISPR-GPT for assisting non-expert researchers with gene-editing experiments from scratch and validate the agent's effectiveness in a real-world use case. Furthermore, we explore the ethical and regulatory considerations associated with automated gene-editing design, highlighting the need for responsible and transparent use of these tools. Our work aims to bridge the gap between beginner biological researchers and CRISPR genome engineering techniques, and demonstrate the potential of LLM agents in facilitating complex biological discovery tasks.
- [274] arXiv:2404.18074 [ pdf , ps , html , other ]
-
Title: MMAC-Copilot: Multi-modal Agent Collaboration Operating System CopilotComments: In processingSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: Autonomous virtual agents are often limited by their singular mode of interaction with real-world environments, restricting their versatility. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with operating systems. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. To evaluate the performance of MMAC-Copilot, we conducted experiments using both the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8\% over existing leading systems. Furthermore, it demonstrated remarkable capability on VIBench, particularly in managing various methods of interaction within systems and applications. These results underscore MMAC-Copilot's potential in advancing the field of autonomous virtual agents through its innovative approach to agent collaboration.
- [275] arXiv:2404.18130 [ pdf , ps , html , other ]
-
Title: Logic Agent: Enhancing Validity with Logic Rule InvocationSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Chain-of-Thought (CoT) prompting has emerged as a pivotal technique for augmenting the inferential capabilities of language models during reasoning tasks. Despite its advancements, CoT often grapples with challenges in validating reasoning validity and ensuring informativeness. Addressing these limitations, this paper introduces the Logic Agent (LA), an agent-based framework aimed at enhancing the validity of reasoning processes in Large Language Models (LLMs) through strategic logic rule invocation. Unlike conventional approaches, LA transforms LLMs into logic agents that dynamically apply propositional logic rules, initiating the reasoning process by converting natural language inputs into structured logic forms. The logic agent leverages a comprehensive set of predefined functions to systematically navigate the reasoning process. This methodology not only promotes the structured and coherent generation of reasoning constructs but also significantly improves their interpretability and logical coherence. Through extensive experimentation, we demonstrate LA's capacity to scale effectively across various model sizes, markedly improving the precision of complex reasoning across diverse tasks.
- [276] arXiv:2404.18202 [ pdf , ps , html , other ]
-
Title: WorldGPT: Empowering LLM as Multimodal World ModelSubjects: Artificial Intelligence (cs.AI) ; Multimedia (cs.MM)
Abstract: World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The project is available on \url{ this https URL }.
- [277] arXiv:2404.18262 [ pdf , ps , html , other ]
-
Title: Generating Situated Reflection Triggers about Alternative Solution Paths: A Case Study of Generative AI for Computer-Supported Collaborative LearningAtharva Naik , Jessica Ruhan Yin , Anusha Kamath , Qianou Ma , Sherry Tongshuang Wu , Charles Murray , Christopher Bogart , Majd Sakr , Carolyn P. RoseSubjects: Artificial Intelligence (cs.AI)
Abstract: An advantage of Large Language Models (LLMs) is their contextualization capability - providing different responses based on student inputs like solution strategy or prior discussion, to potentially better engage students than standard feedback. We present a design and evaluation of a proof-of-concept LLM application to offer students dynamic and contextualized feedback. Specifically, we augment an Online Programming Exercise bot for a college-level Cloud Computing course with ChatGPT, which offers students contextualized reflection triggers during a collaborative query optimization task in database design. We demonstrate that LLMs can be used to generate highly situated reflection triggers that incorporate details of the collaborative discussion happening in context. We discuss in depth the exploration of the design space of the triggers and their correspondence with the learning objectives as well as the impact on student learning in a pilot study with 34 students.
- [278] arXiv:2404.18270 [ pdf , ps , html , other ]
-
Title: Pragmatic Formal Verification of Sequential Error Detection and Correction Codes (ECCs) used in Safety-Critical DesignComments: Published in DVCon U.S. 2023Subjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: Error Detection and Correction Codes (ECCs) are often used in digital designs to protect data integrity. Especially in safety-critical systems such as automotive electronics, ECCs are widely used and the verification of such complex logic becomes more critical considering the ISO 26262 safety standards. Exhaustive verification of ECC using formal methods has been a challenge given the high number of data bits to protect. As an example, for an ECC of 128 data bits with a possibility to detect up to four-bit errors, the combination of bit errors is given by 128C1 + 128C2 + 128C3 + 128C4 = 1.1 * 10^7. This vast analysis space often leads to bounded proof results. Moreover, the complexity and state-space increase further if the ECC has sequential encoding and decoding stages. To overcome such problems and sign-off the design with confidence within reasonable proof time, we present a pragmatic formal verification approach of complex ECC cores with several complexity reduction techniques and know-how that were learnt during the course of verification. We discuss using the linearity of the syndrome generator as a helper assertion, using the abstract model as glue logic to compare the RTL with the sequential version of the circuit, k-induction-based model checking and using mathematical relations captured as properties to simplify the verification in order to get an unbounded proof result within 24 hours of proof runtime.
- [279] arXiv:2404.18296 [ pdf , ps , other ]
-
Title: Using Deep Q-Learning to Dynamically Toggle between Push/Pull Actions in Computational Trust MechanismsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: Recent work on decentralized computational trust models for open Multi Agent Systems has resulted in the development of CA, a biologically inspired model which focuses on the trustee's perspective. This new model addresses a serious unresolved problem in existing trust and reputation models, namely the inability to handle constantly changing behaviors and agents' continuous entry and exit from the system. In previous work, we compared CA to FIRE, a well-known trust and reputation model, and found that CA is superior when the trustor population changes, whereas FIRE is more resilient to the trustee population changes. Thus, in this paper, we investigate how the trustors can detect the presence of several dynamic factors in their environment and then decide which trust model to employ in order to maximize utility. We frame this problem as a machine learning problem in a partially observable environment, where the presence of several dynamic factors is not known to the trustor and we describe how an adaptable trustor can rely on a few measurable features so as to assess the current state of the environment and then use Deep Q Learning (DQN), in a single-agent Reinforcement Learning setting, to learn how to adapt to a changing environment. We ran a series of simulation experiments to compare the performance of the adaptable trustor with the performance of trustors using only one model (FIRE or CA) and we show that an adaptable agent is indeed capable of learning when to use each model and, thus, perform consistently in dynamic environments.
- [280] arXiv:2404.18416 [ pdf , ps , html , other ]
-
Title: Capabilities of Gemini Models in MedicineKhaled Saab , Tao Tu , Wei-Hung Weng , Ryutaro Tanno , David Stutz , Ellery Wulczyn , Fan Zhang , Tim Strother , Chunjong Park , Elahe Vedadi , Juanma Zambrano Chaves , Szu-Yeu Hu , Mike Schaekermann , Aishwarya Kamath , Yong Cheng , David G.T. Barrett , Cathy Cheung , Basil Mustafa , Anil Palepu , Daniel McDuff , Le Hou , Tomer Golany , Luyang Liu , Jean-baptiste Alayrac , Neil Houlsby , Nenad Tomasev , Jan Freyberg , Charles Lau , Jonas Kemp , Jeremy Lai , Shekoofeh Azizi , Kimberly Kanada , SiWai Man , Kavita Kulkarni , Ruoxi Sun , Siamak Shakeri , Luheng He , Ben Caine , Albert Webson , Natasha Latysheva , Melvin Johnson , Philip Mansfield , Jian Lu , Ehud Rivlin , Jesper Anderson , Bradley Green , Renee Wong , Jonathan Krause , Jonathon Shlens , Ewa Dominowska , S. M. Ali Eslami , Katherine Chou , Claire Cui , Oriol Vinyals , Koray Kavukcuoglu , James Manyika , Jeff Dean , Demis Hassabis , Yossi Matias , Dale Webster , Joelle Barral , Greg Corrado , Christopher Semturs , S. Sara Mahdavi , Juraj Gottweis , Alan Karthikesalingam , Vivek NatarajanSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
- [281] arXiv:2404.18533 [ pdf , ps , html , other ]
-
Title: Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and ReadabilitySubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: Despite the surprisingly high intelligence exhibited by Large Language Models (LLMs), we are somehow intimidated to fully deploy them into real-life applications considering their black-box nature. Concept-based explanations arise as a promising avenue for explaining what the LLMs have learned, making them more transparent to humans. However, current evaluations for concepts tend to be heuristic and non-deterministic, e.g. case study or human evaluation, hindering the development of the field. To bridge the gap, we approach concept-based explanation evaluation via faithfulness and readability. We first introduce a formal definition of concept generalizable to diverse concept-based explanations. Based on this, we quantify faithfulness via the difference in the output upon perturbation. We then provide an automatic measure for readability, by measuring the coherence of patterns that maximally activate a concept. This measure serves as a cost-effective and reliable substitute for human evaluation. Finally, based on measurement theory, we describe a meta-evaluation method for evaluating the above measures via reliability and validity, which can be generalized to other tasks as well. Extensive experimental analysis has been conducted to validate and inform the selection of concept evaluation measures.
- [282] arXiv:2404.18638 [ pdf , ps , html , other ]
-
Title: Reinforcement Learning Problem Solving with Large Language ModelsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) encapsulate an extensive amount of world knowledge, and this has enabled their application in various domains to improve the performance of a variety of Natural Language Processing (NLP) tasks. This has also facilitated a more accessible paradigm of conversation-based interactions between humans and AI systems to solve intended problems. However, one interesting avenue that shows untapped potential is the use of LLMs as Reinforcement Learning (RL) agents to enable conversational RL problem solving. Therefore, in this study, we explore the concept of formulating Markov Decision Process-based RL problems as LLM prompting tasks. We demonstrate how LLMs can be iteratively prompted to learn and optimize policies for specific RL tasks. In addition, we leverage the introduced prompting technique for episode simulation and Q-Learning, facilitated by LLMs. We then show the practicality of our approach through two detailed case studies for "Research Scientist" and "Legal Matter Intake" workflows.
- [283] arXiv:2404.18672 [ pdf , ps , html , other ]
-
Title: Graph Convolutional Networks and Graph Attention Networks for Approximating Arguments Acceptability -- Technical ReportComments: 15 pages, 2 figures. Submitted to the 10th International Conference on Computational Models of Argument (COMMA 2024)Subjects: Artificial Intelligence (cs.AI)
Abstract: Various approaches have been proposed for providing efficient computational approaches for abstract argumentation. Among them, neural networks have permitted to solve various decision problems, notably related to arguments (credulous or skeptical) acceptability. In this work, we push further this study in various ways. First, relying on the state-of-the-art approach AFGCN, we show how we can improve the performances of the Graph Convolutional Networks (GCNs) regarding both runtime and accuracy. Then, we show that it is possible to improve even more the efficiency of the approach by modifying the architecture of the network, using Graph Attention Networks (GATs) instead.
- [284] arXiv:2404.18766 [ pdf , ps , html , other ]
-
Title: PECC: Problem Extraction and Coding ChallengesComments: This paper got accepted at LREC-COLING 2024 (long)Subjects: Artificial Intelligence (cs.AI)
Abstract: Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.
- [285] arXiv:2404.18982 [ pdf , ps , other ]
-
Title: Can ChatGPT Make Explanatory Inferences? Benchmarks for Abductive ReasoningSubjects: Artificial Intelligence (cs.AI)
Abstract: Explanatory inference is the creation and evaluation of hypotheses that provide explanations, and is sometimes known as abduction or abductive inference. Generative AI is a new set of artificial intelligence models based on novel algorithms for generating text, images, and sounds. This paper proposes a set of benchmarks for assessing the ability of AI programs to perform explanatory inference, and uses them to determine the extent to which ChatGPT, a leading generative AI model, is capable of making explanatory inferences. Tests on the benchmarks reveal that ChatGPT performs creative and evaluative inferences in many domains, although it is limited to verbal and visual modalities. Claims that ChatGPT and similar models are incapable of explanation, understanding, causal reasoning, meaning, and creativity are rebutted.
- [286] arXiv:2404.19065 [ pdf , ps , other ]
-
Title: HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language ModelsComments: Videos and code this https URLSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training.
- [287] arXiv:2404.19146 [ pdf , ps , html , other ]
-
Title: Automated Construction of Theme-specific Knowledge GraphsSubjects: Artificial Intelligence (cs.AI) ; Information Retrieval (cs.IR)
Abstract: Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.
- [288] arXiv:2404.19234 [ pdf , ps , other ]
-
Title: Multi-hop Question Answering over Knowledge Graphs using Large Language ModelsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Databases (cs.DB)
Abstract: Knowledge graphs (KGs) are large datasets with specific structures representing large knowledge bases (KB) where each node represents a key entity and relations amongst them are typed edges. Natural language queries formed to extract information from a KB entail starting from specific nodes and reasoning over multiple edges of the corresponding KG to arrive at the correct set of answer nodes. Traditional approaches of question answering on KG are based on (a) semantic parsing (SP), where a logical form (e.g., S-expression, SPARQL query, etc.) is generated using node and edge embeddings and then reasoning over these representations or tuning language models to generate the final answer directly, or (b) information-retrieval based that works by extracting entities and relations sequentially. In this work, we evaluate the capability of (LLMs) to answer questions over KG that involve multiple hops. We show that depending upon the size and nature of the KG we need different approaches to extract and feed the relevant information to an LLM since every LLM comes with a fixed context window. We evaluate our approach on six KGs with and without the availability of example-specific sub-graphs and show that both the IR and SP-based methods can be adopted by LLMs resulting in an extremely competitive performance.
- [289] arXiv:2404.19256 [ pdf , ps , other ]
-
Title: Bias Mitigation via Compensation: A Reinforcement Learning PerspectiveComments: 8 pages, 5 diagramsSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: As AI increasingly integrates with human decision-making, we must carefully consider interactions between the two. In particular, current approaches focus on optimizing individual agent actions but often overlook the nuances of collective intelligence. Group dynamics might require that one agent (e.g., the AI system) compensate for biases and errors in another agent (e.g., the human), but this compensation should be carefully developed. We provide a theoretical framework for algorithmic compensation that synthesizes game theory and reinforcement learning principles to demonstrate the natural emergence of deceptive outcomes from the continuous learning dynamics of agents. We provide simulation results involving Markov Decision Processes (MDP) learning to interact. This work then underpins our ethical analysis of the conditions in which AI agents should adapt to biases and behaviors of other agents in dynamic and complex decision-making environments. Overall, our approach addresses the nuanced role of strategic deception of humans, challenging previous assumptions about its detrimental effects. We assert that compensation for others' biases can enhance coordination and ethical alignment: strategic deception, when ethically managed, can positively shape human-AI interactions.
- [290] arXiv:2404.19303 [ pdf , ps , other ]
-
Title: Data Set Terminology of Artificial Intelligence in Medicine: A Historical Review and RecommendationShannon L. Walston , Hiroshi Seki , Hirotaka Takita , Yasuhito Mitsuyama , Shingo Sato , Akifumi Hagiwara , Rintaro Ito , Shouhei Hanaoka , Yukio Miki , Daiju UedaComments: Totally 20 pages, 3 figures, 3 tablesSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: Medicine and artificial intelligence (AI) engineering represent two distinct fields each with decades of published history. With such history comes a set of terminology that has a specific way in which it is applied. However, when two distinct fields with overlapping terminology start to collaborate, miscommunication and misunderstandings can occur. This narrative review aims to give historical context for these terms, accentuate the importance of clarity when these terms are used in medical AI contexts, and offer solutions to mitigate misunderstandings by readers from either field. Through an examination of historical documents, including articles, writing guidelines, and textbooks, this review traces the divergent evolution of terms for data sets and their impact. Initially, the discordant interpretations of the word 'validation' in medical and AI contexts are explored. Then the data sets used for AI evaluation are classified, namely random splitting, cross-validation, temporal, geographic, internal, and external sets. The accurate and standardized description of these data sets is crucial for demonstrating the robustness and generalizability of AI applications in medicine. This review clarifies existing literature to provide a comprehensive understanding of these classifications and their implications in AI evaluation. This review then identifies often misunderstood terms and proposes pragmatic solutions to mitigate terminological confusion. Among these solutions are the use of standardized terminology such as 'training set,' 'validation (or tuning) set,' and 'test set,' and explicit definition of data set splitting terminologies in each medical AI research publication. This review aspires to enhance the precision of communication in medical AI, thereby fostering more effective and transparent research methodologies in this interdisciplinary field.
- [291] arXiv:2404.19336 [ pdf , ps , other ]
-
Title: Improving LLM Classification of Logical Errors by Integrating Error Relationship into PromptsComments: Accepted in ITS 2024Subjects: Artificial Intelligence (cs.AI) ; Programming Languages (cs.PL)
Abstract: LLMs trained in the understanding of programming syntax are now providing effective assistance to developers and are being used in programming education such as in generation of coding problem examples or providing code explanations. A key aspect of programming education is understanding and dealing with error message. However, 'logical errors' in which the program operates against the programmer's intentions do not receive error messages from the compiler. In this study, building on existing research on programming errors, we first define the types of logical errors that can occur in programming in general. Based on the definition, we propose an effective approach for detecting logical errors with LLMs that makes use of relations among error types in the Chain-of-Thought and Tree-of-Thought prompts. The experimental results indicate that when such logical error descriptions in the prompt are used, the average classifition performance is about 21% higher than the ones without them. We also conducted an experiment for exploiting the relations among errors in generating a new logical error dataset using LLMs. As there is very limited dataset for logical errors such benchmark dataset can be very useful for various programming related applications. We expect that our work can assist novice programmers in identifying the causes of code errors and correct them more effectively.
- [292] arXiv:2404.19370 [ pdf , ps , html , other ]
-
Title: Numeric Reward MachinesKristina Levina , Nikolaos Pappas , Athanasios Karapantelakis , Aneta Vulgarakis Feljan , Jendrik SeippComments: ICAPS 2024; Workshop on Bridging the Gap Between AI Planning and Reinforcement LearningSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Reward machines inform reinforcement learning agents about the reward structure of the environment and often drastically speed up the learning process. However, reward machines only accept Boolean features such as robot-reached-gold. Consequently, many inherently numeric tasks cannot profit from the guidance offered by reward machines. To address this gap, we aim to extend reward machines with numeric features such as distance-to-gold. For this, we present two types of reward machines: numeric-Boolean and numeric. In a numeric-Boolean reward machine, distance-to-gold is emulated by two Boolean features distance-to-gold-decreased and robot-reached-gold. In a numeric reward machine, distance-to-gold is used directly alongside the Boolean feature robot-reached-gold. We compare our new approaches to a baseline reward machine in the Craft domain, where the numeric feature is the agent-to-target distance. We use cross-product Q-learning, Q-learning with counter-factual experiences, and the options framework for learning. Our experimental results show that our new approaches significantly outperform the baseline approach. Extending reward machines with numeric features opens up new possibilities of using reward machines in inherently numeric tasks.
- [293] arXiv:2404.19454 [ pdf , ps , html , other ]
-
Title: Optimized neural forms for solving ordinary differential equationsSubjects: Artificial Intelligence (cs.AI)
Abstract: A critical issue in approximating solutions of ordinary differential equations using neural networks is the exact satisfaction of the boundary or initial conditions. For this purpose, neural forms have been introduced, i.e., functional expressions that depend on neural networks which, by design, satisfy the prescribed conditions exactly. Expanding upon prior progress, the present work contributes in three distinct aspects. First, it presents a novel formalism for crafting optimized neural forms. Second, it outlines a method for establishing an upper bound on the absolute deviation from the exact solution. Third, it introduces a technique for converting problems with Neumann or Robin conditions into equivalent problems with parametric Dirichlet conditions. The proposed optimized neural forms were numerically tested on a set of diverse problems, encompassing first-order and second-order ordinary differential equations, as well as first-order systems. Stiff and delay differential equations were also considered. The obtained solutions were compared against solutions obtained via Runge-Kutta methods and exact solutions wherever available. The reported results and analysis verify that in addition to the exact satisfaction of the boundary or initial conditions, optimized neural forms provide closed-form solutions of superior interpolation capability and controllable overall accuracy.
- [294] arXiv:2404.19485 [ pdf , ps , html , other ]
-
Title: IID Relaxation by Logical Expressivity: A Research Agenda for Fitting Logics to Neurosymbolic RequirementsComments: 12 pages, 2 figures, submitted to NeSy 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Neurosymbolic background knowledge and the expressivity required of its logic can break Machine Learning assumptions about data Independence and Identical Distribution. In this position paper we propose to analyze IID relaxation in a hierarchy of logics that fit different use case requirements. We discuss the benefits of exploiting known data dependencies and distribution constraints for Neurosymbolic use cases and argue that the expressivity required for this knowledge has implications for the design of underlying ML routines. This opens a new research agenda with general questions about Neurosymbolic background knowledge and the expressivity required of its logic.
- [295] arXiv:2404.19721 [ pdf , ps , other ]
-
Title: PANGeA: Procedural Artificial Narrative using Generative AI for Turn-Based Video GamesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This research introduces Procedural Artificial Narrative using Generative AI (PANGeA), a structured approach for leveraging large language models (LLMs), guided by a game designer's high-level criteria, to generate narrative content for turn-based role-playing video games (RPGs). Distinct from prior applications of LLMs used for video game design, PANGeA innovates by not only generating game level data (which includes, but is not limited to, setting, key items, and non-playable characters (NPCs)), but by also fostering dynamic, free-form interactions between the player and the environment that align with the procedural game narrative. The NPCs generated by PANGeA are personality-biased and express traits from the Big 5 Personality Model in their generated responses. PANGeA addresses challenges behind ingesting free-form text input, which can prompt LLM responses beyond the scope of the game narrative. A novel validation system that uses the LLM's intelligence evaluates text input and aligns generated responses with the unfolding narrative. Making these interactions possible, PANGeA is supported by a server that hosts a custom memory system that supplies context for augmenting generated responses thus aligning them with the procedural narrative. For its broad application, the server has a REST interface enabling any game engine to integrate directly with PANGeA, as well as an LLM interface adaptable with local or private LLMs. PANGeA's ability to foster dynamic narrative generation by aligning responses with the procedural narrative is demonstrated through an empirical study and ablation test of two versions of a demo game. These are, a custom, browser-based GPT and a Unity demo. As the results show, PANGeA holds potential to assist game designers in using LLMs to generate narrative-consistent content even when provided varied and unpredictable, free-form text input.
- [296] arXiv:2404.00012 (cross-list from q-fin.ST) [ pdf , ps , html , other ]
-
Title: Stress index strategy enhanced with financial news sentiment analysis for the equity marketsBaptiste Lefort , Eric Benhamou , Jean-Jacques Ohana , David Saltiel , Beatrice Guez , Thomas JacquotSubjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Risk Management (q-fin.RM)
Abstract: This paper introduces a new risk-on risk-off strategy for the stock market, which combines a financial stress indicator with a sentiment analysis done by ChatGPT reading and interpreting Bloomberg daily market summaries. Forecasts of market stress derived from volatility and credit spreads are enhanced when combined with the financial news sentiment derived from GPT-4. As a result, the strategy shows improved performance, evidenced by higher Sharpe ratio and reduced maximum drawdowns. The improved performance is consistent across the NASDAQ, the S&P 500 and the six major equity markets, indicating that the method generalises across equities markets.
- [297] arXiv:2404.00013 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Missing Data Imputation With Granular Semantics and AI-driven Pipeline for Bankruptcy PredictionComments: 15 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST); Applications (stat.AP)
Abstract: This work focuses on designing a pipeline for the prediction of bankruptcy. The presence of missing values, high dimensional data, and highly class-imbalance databases are the major challenges in the said task. A new method for missing data imputation with granular semantics has been introduced here. The merits of granular computing have been explored here to define this method. The missing values have been predicted using the feature semantics and reliable observations in a low-dimensional space, in the granular space. The granules are formed around every missing entry, considering a few of the highly correlated features and most reliable closest observations to preserve the relevance and reliability, the context, of the database against the missing entries. An intergranular prediction is then carried out for the imputation within those contextual granules. That is, the contextual granules enable a small relevant fraction of the huge database to be used for imputation and overcome the need to access the entire database repetitively for each missing value. This method is then implemented and tested for the prediction of bankruptcy with the Polish Bankruptcy dataset. It provides an efficient solution for big and high-dimensional datasets even with large imputation rates. Then an AI-driven pipeline for bankruptcy prediction has been designed using the proposed granular semantic-based data filling method followed by the solutions to the issues like high dimensional dataset and high class-imbalance in the dataset. The rest of the pipeline consists of feature selection with the random forest for reducing dimensionality, data balancing with SMOTE, and prediction with six different popular classifiers including deep NN. All methods defined here have been experimentally verified with suitable comparative studies and proven to be effective on all the data sets captured over the five years.
- [298] arXiv:2404.00014 (cross-list from physics.chem-ph) [ pdf , ps , other ]
-
Title: Deep Geometry Handling and Fragment-wise Molecular 3D Graph GenerationOdin Zhang , Yufei Huang , Shichen Cheng , Mengyao Yu , Xujun Zhang , Haitao Lin , Yundian Zeng , Mingyang Wang , Zhenxing Wu , Huifeng Zhao , Zaixi Zhang , Chenqing Hua , Yu Kang , Sunliang Cui , Peichen Pan , Chang-Yu Hsieh , Tingjun HouSubjects: Chemical Physics (physics.chem-ph) ; Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Abstract: Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level.
- [299] arXiv:2404.00018 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Can AI Outperform Human Experts in Creating Social Media Creatives?Comments: 17 pages, 5 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: Artificial Intelligence has outperformed human experts in functional tasks such as chess and baduk. How about creative tasks? This paper evaluates AI's capability in the creative domain compared to human experts, which little research has been conducted so far. We propose a novel Prompt-for-Prompt to generate social media creatives via prompt augmentation by Large Language Models. We take the most popular Instagram posts (with the biggest number of like clicks) in top brands' Instagram accounts to create social media creatives. We give GPT 4 several prompt instructions with text descriptions to generate the most effective prompts for cutting-edge text-to-image generators: Midjourney, DALL E 3, and Stable Diffusion. LLM-augmented prompts can boost AI's abilities by adding objectives, engagement strategy, lighting and brand consistency for social media image creation. We conduct an extensive human evaluation experiment, and find that AI excels human experts, and Midjourney is better than the other text-to-image generators. Surprisingly, unlike conventional wisdom in the social media industry, prompt instruction including eye-catching shows much poorer performance than those including natural. Regarding the type of creatives, AI improves creatives with animals or products but less with real people. Also, AI improves creatives with short text descriptions more than with long text descriptions, because there is more room for AI to augment prompts with shorter descriptions.
- [300] arXiv:2404.00019 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Advancing Explainable Autonomous Vehicle Systems: A Comprehensive Review and Research RoadmapSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Given the uncertainty surrounding how existing explainability methods for autonomous vehicles (AVs) meet the diverse needs of stakeholders, a thorough investigation is imperative to determine the contexts requiring explanations and suitable interaction strategies. A comprehensive review becomes crucial to assess the alignment of current approaches with the varied interests and expectations within the AV ecosystem. This study presents a review to discuss the complexities associated with explanation generation and presentation to facilitate the development of more effective and inclusive explainable AV systems. Our investigation led to categorising existing literature into three primary topics: explanatory tasks, explanatory information, and explanatory information communication. Drawing upon our insights, we have proposed a comprehensive roadmap for future research centred on (i) knowing the interlocutor, (ii) generating timely explanations, (ii) communicating human-friendly explanations, and (iv) continuous learning. Our roadmap is underpinned by principles of responsible research and innovation, emphasising the significance of diverse explanation requirements. To effectively tackle the challenges associated with implementing explainable AV systems, we have delineated various research directions, including the development of privacy-preserving data integration, ethical frameworks, real-time analytics, human-centric interaction design, and enhanced cross-disciplinary collaborations. By exploring these research directions, the study aims to guide the development and deployment of explainable AVs, informed by a holistic understanding of user needs, technological advancements, regulatory compliance, and ethical considerations, thereby ensuring safer and more trustworthy autonomous driving experiences.
- [301] arXiv:2404.00022 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Analysing and Organising Human Communications for AI Fairness-Related Decisions: Use Cases from the Public SectorMirthe Dankloff , Vanja Skoric , Giovanni Sileno , Sennay Ghebreab , Jacco Van Ossenbruggen , Emma Beauxis-AussaletSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: AI algorithms used in the public sector, e.g., for allocating social benefits or predicting fraud, often involve multiple public and private stakeholders at various phases of the algorithm's life-cycle. Communication issues between these diverse stakeholders can lead to misinterpretation and misuse of algorithms. We investigate the communication processes for AI fairness-related decisions by conducting interviews with practitioners working on algorithmic systems in the public sector. By applying qualitative coding analysis, we identify key elements of communication processes that underlie fairness-related human decisions. We analyze the division of roles, tasks, skills, and challenges perceived by stakeholders. We formalize the underlying communication issues within a conceptual framework that i. represents the communication patterns ii. outlines missing elements, such as actors who miss skills for their tasks. The framework is used for describing and analyzing key organizational issues for fairness-related decisions. Three general patterns emerge from the analysis: 1. Policy-makers, civil servants, and domain experts are less involved compared to developers throughout a system's life-cycle. This leads to developers taking on extra roles such as advisor, while they potentially miss the required skills and guidance from domain experts. 2. End-users and policy-makers often lack the technical skills to interpret a system's limitations, and rely on developer roles for making decisions concerning fairness issues. 3. Citizens are structurally absent throughout a system's life-cycle, which may lead to decisions that do not include relevant considerations from impacted stakeholders.
- [302] arXiv:2404.00026 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Ink and Individuality: Crafting a Personalised Narrative in the Age of LLMsComments: 4 Pages, 2 FiguresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Individuality and personalization comprise the distinctive characteristics that make each writer unique and influence their words in order to effectively engage readers while conveying authenticity. However, our growing reliance on LLM-based writing assistants risks compromising our creativity and individuality over time. We often overlook the negative impacts of this trend on our creativity and uniqueness, despite the possible consequences. This study investigates these concerns by performing a brief survey to explore different perspectives and concepts, as well as trying to understand people's viewpoints, in conjunction with past studies in the area. Addressing these issues is essential for improving human-computer interaction systems and enhancing writing assistants for personalization and individuality.
- [303] arXiv:2404.00027 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: LLMs as Writing Assistants: Exploring Perspectives on Sense of Ownership and ReasoningComments: 4 Pages, 2 FiguresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Sense of ownership in writing confines our investment of thoughts, time, and contribution, leading to attachment to the output. However, using writing assistants introduces a mental dilemma, as some content isn't directly our creation. For instance, we tend to credit Large Language Models (LLMs) more in creative tasks, even though all tasks are equal for them. Additionally, while we may not claim complete ownership of LLM-generated content, we freely claim authorship. We conduct a short survey to examine these issues and understand underlying cognitive processes in order to gain a better knowledge of human-computer interaction in writing and improve writing aid systems.
- [304] arXiv:2404.00029 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Complementarity in Human-AI Collaboration: Concept, Sources, and EvidenceSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Artificial intelligence (AI) can improve human decision-making in various application areas. Ideally, collaboration between humans and AI should lead to complementary team performance (CTP) -- a level of performance that neither of them can attain individually. So far, however, CTP has rarely been observed, suggesting an insufficient understanding of the complementary constituents in human-AI collaboration that can contribute to CTP in decision-making. This work establishes a holistic theoretical foundation for understanding and developing human-AI complementarity. We conceptualize complementarity by introducing and formalizing the notion of complementarity potential and its realization. Moreover, we identify and outline sources that explain CTP. We illustrate our conceptualization by applying it in two empirical studies exploring two different sources of complementarity potential. In the first study, we focus on information asymmetry as a source and, in a real estate appraisal use case, demonstrate that humans can leverage unique contextual information to achieve CTP. In the second study, we focus on capability asymmetry as an alternative source, demonstrating how heterogeneous capabilities can help achieve CTP. Our work provides researchers with a theoretical foundation of complementarity in human-AI decision-making and demonstrates that leveraging sources of complementarity potential constitutes a viable pathway toward effective human-AI collaboration.
- [305] arXiv:2404.00039 (cross-list from cs.PF) [ pdf , ps , html , other ]
-
Title: MicroHD: An Accuracy-Driven Optimization of Hyperdimensional Computing Algorithms for TinyML systemsComments: Accepted as a full paper by the tinyML Research Symposium 2024Subjects: Performance (cs.PF) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
Abstract: Hyperdimensional computing (HDC) is emerging as a promising AI approach that can effectively target TinyML applications thanks to its lightweight computing and memory requirements. Previous works on HDC showed that limiting the standard 10k dimensions of the hyperdimensional space to much lower values is possible, reducing even more HDC resource requirements. Similarly, other studies demonstrated that binary values can be used as elements of the generated hypervectors, leading to significant efficiency gains at the cost of some degree of accuracy degradation. Nevertheless, current optimization attempts do not concurrently co-optimize HDC hyper-parameters, and accuracy degradation is not directly controlled, resulting in sub-optimal HDC models providing several applications with unacceptable output qualities. In this work, we propose MicroHD, a novel accuracy-driven HDC optimization approach that iteratively tunes HDC hyper-parameters, reducing memory and computing requirements while ensuring user-defined accuracy levels. The proposed method can be applied to HDC implementations using different encoding functions, demonstrates good scalability for larger HDC workloads, and achieves compression and efficiency gains up to 200x when compared to baseline implementations for accuracy degradations lower than 1%.
- [306] arXiv:2404.00042 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Stochastic Optimization with Constraints: A Non-asymptotic Instance-Dependent AnalysisComments: 18 pagesSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: We consider the problem of stochastic convex optimization under convex constraints. We analyze the behavior of a natural variance reduced proximal gradient (VRPG) algorithm for this problem. Our main result is a non-asymptotic guarantee for VRPG algorithm. Contrary to minimax worst case guarantees, our result is instance-dependent in nature. This means that our guarantee captures the complexity of the loss function, the variability of the noise, and the geometry of the constraint set. We show that the non-asymptotic performance of the VRPG algorithm is governed by the scaled distance (scaled by $\sqrt{N}$) between the solutions of the given problem and that of a certain small perturbation of the given problem -- both solved under the given convex constraints; here, $N$ denotes the number of samples. Leveraging a well-established connection between local minimax lower bounds and solutions to perturbed problems, we show that as $N \rightarrow \infty$, the VRPG algorithm achieves the renowned local minimax lower bound by Hàjek and Le Cam up to universal constants and a logarithmic factor of the sample size.
- [307] arXiv:2404.00044 (cross-list from physics.chem-ph) [ pdf , ps , html , other ]
-
Title: UAlign: Pushing the Limit of Template-free Retrosynthesis Prediction with Unsupervised SMILES AlignmentSubjects: Chemical Physics (physics.chem-ph) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Abstract: Motivation: Retrosynthesis planning poses a formidable challenge in the organic chemical industry. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chemical knowledge dependency.
Results: This paper introduces UAlign, a template-free graph-to-sequence pipeline for retrosynthesis prediction. By combining graph neural networks and Transformers, our method can more effectively leverage the inherent graph structure of molecules. Based on the fact that the majority of molecule structures remain unchanged during a chemical reaction, we propose a simple yet effective SMILES alignment technique to facilitate the reuse of unchanged structures for reactant generation. Extensive experiments show that our method substantially outperforms state-of-the-art template-free and semi-template-based approaches. Importantly, our template-free method achieves effectiveness comparable to, or even surpasses, established powerful template-based methods.
Scientific contribution: We present a novel graph-to-sequence template-free retrosynthesis prediction pipeline that overcomes the limitations of Transformer-based methods in molecular representation learning and insufficient utilization of chemical information. We propose an unsupervised learning mechanism for establishing product-atom correspondence with reactant SMILES tokens, achieving even better results than supervised SMILES alignment methods. Extensive experiments demonstrate that UAlign significantly outperforms state-of-the-art template-free methods and rivals or surpasses template-based approaches, with up to 5\% (top-5) and 5.4\% (top-10) increased accuracy over the strongest baseline. - [308] arXiv:2404.00045 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Policy Optimization finds Nash Equilibrium in Regularized General-Sum LQ GamesSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: In this paper, we investigate the impact of introducing relative entropy regularization on the Nash Equilibria (NE) of General-Sum $N$-agent games, revealing the fact that the NE of such games conform to linear Gaussian policies. Moreover, it delineates sufficient conditions, contingent upon the adequacy of entropy regularization, for the uniqueness of the NE within the game. As Policy Optimization serves as a foundational approach for Reinforcement Learning (RL) techniques aimed at finding the NE, in this work we prove the linear convergence of a policy optimization algorithm which (subject to the adequacy of entropy regularization) is capable of provably attaining the NE. Furthermore, in scenarios where the entropy regularization proves insufficient, we present a $\delta$-augmentation technique, which facilitates the achievement of an $\epsilon$-NE within the game.
- [309] arXiv:2404.00057 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: PerOS: Personalized Self-Adapting Operating Systems in the CloudComments: 29 pages, 3 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Operating Systems (cs.OS)
Abstract: Operating systems (OSes) are foundational to computer systems, managing hardware resources and ensuring secure environments for diverse applications. However, despite their enduring importance, the fundamental design objectives of OSes have seen minimal evolution over decades. Traditionally prioritizing aspects like speed, memory efficiency, security, and scalability, these objectives often overlook the crucial aspect of intelligence as well as personalized user experience. The lack of intelligence becomes increasingly critical amid technological revolutions, such as the remarkable advancements in machine learning (ML).
Today's personal devices, evolving into intimate companions for users, pose unique challenges for traditional OSes like Linux and iOS, especially with the emergence of specialized hardware featuring heterogeneous components. Furthermore, the rise of large language models (LLMs) in ML has introduced transformative capabilities, reshaping user interactions and software development paradigms.
While existing literature predominantly focuses on leveraging ML methods for system optimization or accelerating ML workloads, there is a significant gap in addressing personalized user experiences at the OS level. To tackle this challenge, this work proposes PerOS, a personalized OS ingrained with LLM capabilities. PerOS aims to provide tailored user experiences while safeguarding privacy and personal data through declarative interfaces, self-adaptive kernels, and secure data management in a scalable cloud-centric architecture; therein lies the main research question of this work: How can we develop intelligent, secure, and scalable OSes that deliver personalized experiences to thousands of users? - [310] arXiv:2404.00060 (cross-list from q-fin.ST) [ pdf , ps , html , other ]
-
Title: Temporal Graph Networks for Graph Anomaly Detection in Financial NetworksComments: Presented at the AAAI 2024 Workshop on AI in Finance for Social Impact ( this https URL )Subjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper explores the utilization of Temporal Graph Networks (TGN) for financial anomaly detection, a pressing need in the era of fintech and digitized financial transactions. We present a comprehensive framework that leverages TGN, capable of capturing dynamic changes in edges within financial networks, for fraud detection. Our study compares TGN's performance against static Graph Neural Network (GNN) baselines, as well as cutting-edge hypergraph neural network baselines using DGraph dataset for a realistic financial context. Our results demonstrate that TGN significantly outperforms other models in terms of AUC metrics. This superior performance underlines TGN's potential as an effective tool for detecting financial fraud, showcasing its ability to adapt to the dynamic and complex nature of modern financial systems. We also experimented with various graph embedding modules within the TGN framework and compared the effectiveness of each module. In conclusion, we demonstrated that, even with variations within TGN, it is possible to achieve good performance in the anomaly detection task.
- [311] arXiv:2404.00076 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping AttacksComments: Accept by "IEEE Access" let's take a look at our global approach to the DNN(s) model(s) deployment chain in production: Danger NLP-Speech (Trigger universal approach)Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Audio-based machine learning systems frequently use public or third-party data, which might be inaccurate. This exposes deep neural network (DNN) models trained on such data to potential data poisoning attacks. In this type of assault, attackers can train the DNN model using poisoned data, potentially degrading its performance. Another type of data poisoning attack that is extremely relevant to our investigation is label flipping, in which the attacker manipulates the labels for a subset of data. It has been demonstrated that these assaults may drastically reduce system performance, even for attackers with minimal abilities. In this study, we propose a backdoor attack named 'DirtyFlipping', which uses dirty label techniques, "label-on-label", to input triggers (clapping) in the selected data patterns associated with the target class, thereby enabling a stealthy backdoor.
- [312] arXiv:2404.00081 (cross-list from q-bio.BM) [ pdf , ps , html , other ]
-
Title: Molecular Generative Adversarial Network with Multi-Property OptimizationSubjects: Biomolecules (q-bio.BM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep generative models, such as generative adversarial networks (GANs), have been employed for $de~novo$ molecular generation in drug discovery. Most prior studies have utilized reinforcement learning (RL) algorithms, particularly Monte Carlo tree search (MCTS), to handle the discrete nature of molecular representations in GANs. However, due to the inherent instability in training GANs and RL models, along with the high computational cost associated with MCTS sampling, MCTS RL-based GANs struggle to scale to large chemical databases. To tackle these challenges, this study introduces a novel GAN based on actor-critic RL with instant and global rewards, called InstGAN, to generate molecules at the token-level with multi-property optimization. Furthermore, maximized information entropy is leveraged to alleviate the mode collapse. The experimental results demonstrate that InstGAN outperforms other baselines, achieves comparable performance to state-of-the-art models, and efficiently generates molecules with multi-property optimization. The source code will be released upon acceptance of the paper.
- [313] arXiv:2404.00107 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Robust Ensemble Person Re-Identification via Orthogonal Fusion with Occlusion HandlingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Occlusion remains one of the major challenges in person reidentification (ReID) as a result of the diversity of poses and the variation of appearances. Developing novel architectures to improve the robustness of occlusion-aware person Re-ID requires new insights, especially on low-resolution edge cameras. We propose a deep ensemble model that harnesses both CNN and Transformer architectures to generate robust feature representations. To achieve robust Re-ID without the need to manually label occluded regions, we propose to take an ensemble learning-based approach derived from the analogy between arbitrarily shaped occluded regions and robust feature representation. Using the orthogonality principle, our developed deep CNN model makes use of masked autoencoder (MAE) and global-local feature fusion for robust person identification. Furthermore, we present a part occlusion-aware transformer capable of learning feature space that is robust to occluded regions. Experimental results are reported on several Re-ID datasets to show the effectiveness of our developed ensemble model named orthogonal fusion with occlusion handling (OFOH). Compared to competing methods, the proposed OFOH approach has achieved competent rank-1 and mAP performance.
- [314] arXiv:2404.00137 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Budget-aware Query Tuning: An AutoML PerspectiveSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Modern database systems rely on cost-based query optimizers to come up with good execution plans for input queries. Such query optimizers rely on cost models to estimate the costs of candidate query execution plans. A cost model represents a function from a set of cost units to query execution cost, where each cost unit specifies the unit cost of executing a certain type of query processing operation (such as table scan or join). These cost units are traditionally viewed as constants, whose values only depend on the platform configuration where the database system runs on top of but are invariant for queries processed by the database system. In this paper, we challenge this classic view by thinking of these cost units as variables instead. We show that, by varying the cost-unit values one can obtain query plans that significantly outperform the default query plans returned by the query optimizer when viewing the cost units as constants. We term this cost-unit tuning process "query tuning" (QT) and show that it is similar to the well-known hyper-parameter optimization (HPO) problem in AutoML. As a result, any state-of-the-art HPO technologies can be applied to QT. We study the QT problem in the context of anytime tuning, which is desirable in practice by constraining the total time spent on QT within a given budget -- we call this problem budget-aware query tuning. We further extend our study from tuning a single query to tuning a workload with multiple queries, and we call this generalized problem budget-aware workload tuning (WT), which aims for minimizing the execution time of the entire workload. WT is more challenging as one needs to further prioritize individual query tuning within the given time budget. We propose solutions to both QT and WT and experimental evaluation using both benchmark and real workloads demonstrates the efficacy of our proposed solutions.
- [315] arXiv:2404.00139 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Security Risks Concerns of Generative AI in the IoTComments: 6 pages, 2 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: In an era where the Internet of Things (IoT) intersects increasingly with generative Artificial Intelligence (AI), this article scrutinizes the emergent security risks inherent in this integration. We explore how generative AI drives innovation in IoT and we analyze the potential for data breaches when using generative AI and the misuse of generative AI technologies in IoT ecosystems. These risks not only threaten the privacy and efficiency of IoT systems but also pose broader implications for trust and safety in AI-driven environments. The discussion in this article extends to strategic approaches for mitigating these risks, including the development of robust security protocols, the multi-layered security approaches, and the adoption of AI technological solutions. Through a comprehensive analysis, this article aims to shed light on the critical balance between embracing AI advancements and ensuring stringent security in IoT, providing insights into the future direction of these intertwined technologies.
- [316] arXiv:2404.00143 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Accelerating Search-Based Planning for Multi-Robot Manipulation by Leveraging Online-Generated ExperiencesComments: The first two authors contributed equally. Accepted to ICAPS 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: An exciting frontier in robotic manipulation is the use of multiple arms at once. However, planning concurrent motions is a challenging task using current methods. The high-dimensional composite state space renders many well-known motion planning algorithms intractable. Recently, Multi-Agent Path-Finding (MAPF) algorithms have shown promise in discrete 2D domains, providing rigorous guarantees. However, widely used conflict-based methods in MAPF assume an efficient single-agent motion planner. This poses challenges in adapting them to manipulation cases where this assumption does not hold, due to the high dimensionality of configuration spaces and the computational bottlenecks associated with collision checking. To this end, we propose an approach for accelerating conflict-based search algorithms by leveraging their repetitive and incremental nature -- making them tractable for use in complex scenarios involving multi-arm coordination in obstacle-laden environments. We show that our method preserves completeness and bounded sub-optimality guarantees, and demonstrate its practical efficacy through a set of experiments with up to 10 robotic arms.
- [317] arXiv:2404.00166 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Uncovering Bias in Large Vision-Language Models with CounterfactualsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different LVLMs under this counterfactual generation setting and find that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence toxicity and the generation of competency-associated words.
- [318] arXiv:2404.00172 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Universal Bovine Identification via Depth Data and Deep Metric LearningAsheesh Sharma , Lucy Randewich , William Andrew , Sion Hannuna , Neill Campbell , Siobhan Mullan , Andrew W. Dowsey , Melvyn Smith , Mark Hansen , Tilo BurghardtComments: LaTeX, 38 pages, 14 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper proposes and evaluates, for the first time, a top-down (dorsal view), depth-only deep learning system for accurately identifying individual cattle and provides associated code, datasets, and training weights for immediate reproducibility. An increase in herd size skews the cow-to-human ratio at the farm and makes the manual monitoring of individuals more challenging. Therefore, real-time cattle identification is essential for the farms and a crucial step towards precision livestock farming. Underpinned by our previous work, this paper introduces a deep-metric learning method for cattle identification using depth data from an off-the-shelf 3D camera. The method relies on CNN and MLP backbones that learn well-generalised embedding spaces from the body shape to differentiate individuals -- requiring neither species-specific coat patterns nor close-up muzzle prints for operation. The network embeddings are clustered using a simple algorithm such as $k$-NN for highly accurate identification, thus eliminating the need to retrain the network for enrolling new individuals. We evaluate two backbone architectures, ResNet, as previously used to identify Holstein Friesians using RGB images, and PointNet, which is specialised to operate on 3D point clouds. We also present CowDepth2023, a new dataset containing 21,490 synchronised colour-depth image pairs of 99 cows, to evaluate the backbones. Both ResNet and PointNet architectures, which consume depth maps and point clouds, respectively, led to high accuracy that is on par with the coat pattern-based backbone.
- [319] arXiv:2404.00178 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Beyond Suspension: A Two-phase Methodology for Concluding Sports LeaguesComments: 32 pages, 9 figuresSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Problem definition: Professional sports leagues may be suspended due to various reasons such as the recent COVID-19 pandemic. A critical question the league must address when re-opening is how to appropriately select a subset of the remaining games to conclude the season in a shortened time frame. Academic/practical relevance: Despite the rich literature on scheduling an entire season starting from a blank slate, concluding an existing season is quite different. Our approach attempts to achieve team rankings similar to that which would have resulted had the season been played out in full. Methodology: We propose a data-driven model which exploits predictive and prescriptive analytics to produce a schedule for the remainder of the season comprised of a subset of originally-scheduled games. Our model introduces novel rankings-based objectives within a stochastic optimization model, whose parameters are first estimated using a predictive model. We introduce a deterministic equivalent reformulation along with a tailored Frank-Wolfe algorithm to efficiently solve our problem, as well as a robust counterpart based on min-max regret. Results: We present simulation-based numerical experiments from previous National Basketball Association (NBA) seasons 2004--2019, and show that our models are computationally efficient, outperform a greedy benchmark that approximates a non-rankings-based scheduling policy, and produce interpretable results. Managerial implications: Our data-driven decision-making framework may be used to produce a shortened season with 25-50\% fewer games while still producing an end-of-season ranking similar to that of the full season, had it been played.
- [320] arXiv:2404.00185 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: On Inherent Adversarial Robustness of Active Vision SystemsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Current Deep Neural Networks are vulnerable to adversarial examples, which alter their predictions by adding carefully crafted noise. Since human eyes are robust to such inputs, it is possible that the vulnerability stems from the standard way of processing inputs in one shot by processing every pixel with the same importance. In contrast, neuroscience suggests that the human vision system can differentiate salient features by (1) switching between multiple fixation points (saccades) and (2) processing the surrounding with a non-uniform external resolution (foveation). In this work, we advocate that the integration of such active vision mechanisms into current deep learning systems can offer robustness benefits. Specifically, we empirically demonstrate the inherent robustness of two active vision methods - GFNet and FALcon - under a black box threat model. By learning and inferencing based on downsampled glimpses obtained from multiple distinct fixation points within an input, we show that these active methods achieve (2-3) times greater robustness compared to a standard passive convolutional network under state-of-the-art adversarial attacks. More importantly, we provide illustrative and interpretable visualization analysis that demonstrates how performing inference from distinct fixation points makes active vision methods less vulnerable to malicious inputs.
- [321] arXiv:2404.00188 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language QueriesManit Mishra , Abderrahman Braham , Charles Marsom , Bryan Chung , Gavin Griffin , Dakshesh Sidnerlikar , Chatanya Sarin , Arjun RajaramComments: 5 pages, Submitted to International Conference on AI in CybersecuritySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Conventional processes for analyzing datasets and extracting meaningful information are often time-consuming and laborious. Previous work has identified manual, repetitive coding and data collection as major obstacles that hinder data scientists from undertaking more nuanced labor and high-level projects. To combat this, we evaluated OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) that can extrapolate key findings, including correlations and basic information, from a given dataset. The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards, including data science code-generation based tasks involving libraries such as NumPy, Pandas, Scikit-Learn, and TensorFlow, and was broadly successful in correctly answering a given data science query related to the benchmark dataset. The LDS used various novel prompt engineering techniques to effectively answer a given question, including Chain-of-Thought reinforcement and SayCan prompt engineering. Our findings demonstrate great potential for leveraging Large Language Models for low-level, zero-shot data analysis.
- [322] arXiv:2404.00195 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multiple-policy Evaluation via Density EstimationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this work, we focus on the multiple-policy evaluation problem where we are given a set of $K$ target policies and the goal is to evaluate their performance (the expected total rewards) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named $\mathrm{CAESAR}$ to address this problem. Our approach is based on computing an approximate optimal offline sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. $\mathrm{CAESAR}$ consists of two phases. In the first one we produce coarse estimates of the vistation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the objective in DualDICE. Up to low order and logarithm terms $\mathrm{CAESAR}$ achieves a sample complexity $\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu^*_h(s,a)}\right)$, where $d^{\pi}$ is the visitation distribution of policy $\pi$ and $\mu^*$ is the optimal sampling distribution.
- [323] arXiv:2404.00207 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Causal Inference for Human-Language Model CollaborationComments: 9 pages (Accepted for publication at NAACL 2024 (Main Conference))Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we examine the collaborative dynamics between humans and language models (LMs), where the interactions typically involve LMs proposing text segments and humans editing or responding to these proposals. Productive engagement with LMs in such scenarios necessitates that humans discern effective text-based interaction strategies, such as editing and response styles, from historical human-LM interactions. This objective is inherently causal, driven by the counterfactual `what-if' question: how would the outcome of collaboration change if humans employed a different text editing/refinement strategy? A key challenge in answering this causal inference question is formulating an appropriate causal estimand: the conventional average treatment effect (ATE) estimand is inapplicable to text-based treatments due to their high dimensionality. To address this concern, we introduce a new causal estimand -- Incremental Stylistic Effect (ISE) -- which characterizes the average impact of infinitesimally shifting a text towards a specific style, such as increasing formality. We establish the conditions for the non-parametric identification of ISE. Building on this, we develop CausalCollab, an algorithm designed to estimate the ISE of various interaction strategies in dynamic human-LM collaborations. Our empirical investigations across three distinct human-LM collaboration scenarios reveal that CausalCollab effectively reduces confounding and significantly improves counterfactual estimation over a set of competitive baselines.
- [324] arXiv:2404.00216 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Is Factuality Decoding a Free Lunch for LLMs? Evaluation on Knowledge Editing BenchmarkSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapid development of large language models (LLMs) enables them to convey factual knowledge in a more human-like fashion. Extensive efforts have been made to reduce factual hallucinations by modifying LLMs with factuality decoding. However, they also pose risks of hindering knowledge updates, as they make models overly confident in known facts. In this work, we first revisite the current factuality decoding methods and verified their effectiveness in enhancing factual accuracy. Subsequently, we conduct further evaluation of several strong factuality decoding methods on the knowledge editing benchmark. All these decoding methods significantly diminish the performance of llama2 models compared to their original decoding, with the largest decrease being a staggering 81.3\%. This further indicates that the current existing decoding methods still cannot perfectly address the factual hallucinations, as they overlook the importance of preserving the flexibility for knowledge editing. Therefore, our work suggests that research into factual alignment should simultaneously focus on the effectiveness of knowledge editing.
- [325] arXiv:2404.00228 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: InfLoRA: Interference-Free Low-Rank Adaptation for Continual LearningComments: Accepted by the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Continual learning requires the model to learn multiple tasks sequentially. In continual learning, the model should possess the ability to maintain its performance on old tasks (stability) and the ability to adapt to new tasks continuously (plasticity). Recently, parameter-efficient fine-tuning (PEFT), which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks, has gained increasing popularity in continual learning. Although existing continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT, most of them do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity. In this work, we propose a new PEFT method, called interference-free low-rank adaptation (InfLoRA), for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore, InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets.
- [326] arXiv:2404.00231 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Attention-based Shape-Deformation Networks for Artifact-Free Geometry Reconstruction of Lumbar Spine from MR ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Lumbar disc degeneration, a progressive structural wear and tear of lumbar intervertebral disc, is regarded as an essential role on low back pain, a significant global health concern. Automated lumbar spine geometry reconstruction from MR images will enable fast measurement of medical parameters to evaluate the lumbar status, in order to determine a suitable treatment. Existing image segmentation-based techniques often generate erroneous segments or unstructured point clouds, unsuitable for medical parameter measurement. In this work, we present $\textit{UNet-DeformSA}$ and $\textit{TransDeformer}$: novel attention-based deep neural networks that reconstruct the geometry of the lumbar spine with high spatial accuracy and mesh correspondence across patients, and we also present a variant of $\textit{TransDeformer}$ for error estimation. Specially, we devise new attention modules with a new attention formula, which integrate image features and tokenized contour features to predict the displacements of the points on a shape template without the need for image segmentation. The deformed template reveals the lumbar spine geometry in an image. Experiment results show that our networks generate artifact-free geometry outputs, and the variant of $\textit{TransDeformer}$ can predict the errors of a reconstructed geometry. Our code is available at this https URL .
- [327] arXiv:2404.00242 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM InferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Decoding using tree search can greatly enhance the inference quality for transformer-based Large Language Models (LLMs). Depending on the guidance signal, it searches for the best path from root to leaf in the tree by forming LLM outputs to improve controllability, reasoning ability, alignment, et cetera. However, current tree decoding strategies and their inference systems do not suit each other well due to redundancy in computation, memory footprints, and memory access, resulting in inefficient inference. To address this issue, we propose DeFT, an IO-aware tree attention algorithm that maintains memory-efficient attention calculation with low memory footprints in two stages: (1) QKV Preparation: we propose a KV-Guided Tree Split strategy to group QKV wisely for high utilization of GPUs and reduction of memory reads/writes for the KV cache between GPU global memory and on-chip shared memory as much as possible; (2) Attention Calculation: we calculate partial attention of each QKV groups in a fused kernel then apply a Tree-topology-aware Global Reduction strategy to get final attention. Thanks to a reduction in KV cache IO by 3.6-4.5$\times$, along with an additional reduction in IO for $\mathbf{Q} \mathbf{K}^\top$ and Softmax equivalent to 25% of the total KV cache IO, DeFT can achieve a speedup of 1.7-2.4$\times$ in end-to-end latency across two practical reasoning tasks over the SOTA attention algorithms.
- [328] arXiv:2404.00246 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks WorldSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Language agents that interact with the world on their own have great potential for automating digital tasks. While large language model (LLM) agents have made progress in understanding and executing tasks such as textual games and webpage control, many real-world tasks also require collaboration with humans or other LLMs in equal roles, which involves intent understanding, task coordination, and communication. To test LLM's ability to collaborate, we design a blocks-world environment, where two agents, each having unique goals and skills, build a target structure together. To complete the goals, they can act in the world and communicate in natural language. Under this environment, we design increasingly challenging settings to evaluate different collaboration perspectives, from independent to more complex, dependent tasks. We further adopt chain-of-thought prompts that include intermediate reasoning steps to model the partner's state and identify and correct execution errors. Both human-machine and machine-machine experiments show that LLM agents have strong grounding capacities, and our approach significantly improves the evaluation metric.
- [329] arXiv:2404.00247 (cross-list from eess.SY) [ pdf , ps , other ]
-
Title: Facilitating Reinforcement Learning for Process Control Using Transfer Learning: PerspectivesComments: Final Version of Asian Control Conference (ASCC 2024)Subjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper provides insights into deep reinforcement learning (DRL) for process control from the perspective of transfer learning. We analyze the challenges of applying DRL in the field of process industries and the necessity of introducing transfer learning. Furthermore, recommendations and prospects are provided for future research directions on how transfer learning can be integrated with DRL to empower process control.
- [330] arXiv:2404.00257 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: YOLOOC: YOLO-based Open-Class Incremental Object Detection with Novel Class DiscoveryComments: Withdrawn because it was submitted without consent of the first author. In addition, this submission has some errorsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: Because of its use in practice, open-world object detection (OWOD) has gotten a lot of attention recently. The challenge is how can a model detect novel classes and then incrementally learn them without forgetting previously known classes. Previous approaches hinge on strongly-supervised or weakly-supervised novel-class data for novel-class detection, which may not apply to real applications. We construct a new benchmark that novel classes are only encountered at the inference stage. And we propose a new OWOD detector YOLOOC, based on the YOLO architecture yet for the Open-Class setup. We introduce label smoothing to prevent the detector from over-confidently mapping novel classes to known classes and to discover novel classes. Extensive experiments conducted on our more realistic setup demonstrate the effectiveness of our method for discovering novel classes in our new benchmark.
- [331] arXiv:2404.00261 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Simple Yet Effective Approach for Diversified Session-Based RecommendationSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Session-based recommender systems (SBRSs) have become extremely popular in view of the core capability of capturing short-term and dynamic user preferences. However, most SBRSs primarily maximize recommendation accuracy but ignore user minor preferences, thus leading to filter bubbles in the long run. Only a handful of works, being devoted to improving diversity, depend on unique model designs and calibrated loss functions, which cannot be easily adapted to existing accuracy-oriented SBRSs. It is thus worthwhile to come up with a simple yet effective design that can be used as a plugin to facilitate existing SBRSs on generating a more diversified list in the meantime preserving the recommendation accuracy. In this case, we propose an end-to-end framework applied for every existing representative (accuracy-oriented) SBRS, called diversified category-aware attentive SBRS (DCA-SBRS), to boost the performance on recommendation diversity. It consists of two novel designs: a model-agnostic diversity-oriented loss function, and a non-invasive category-aware attention mechanism. Extensive experiments on three datasets showcase that our framework helps existing SBRSs achieve extraordinary performance in terms of recommendation diversity and comprehensive performance, without significantly deteriorating recommendation accuracy compared to state-of-the-art accuracy-oriented SBRSs.
- [332] arXiv:2404.00271 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: TG-NAS: Leveraging Zero-Cost Proxies with Transformer and Graph Convolution Networks for Efficient Neural Architecture SearchSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Neural architecture search (NAS) is an effective method for discovering new convolutional neural network (CNN) architectures. However, existing approaches often require time-consuming training or intensive sampling and evaluations. Zero-shot NAS aims to create training-free proxies for architecture performance prediction. However, existing proxies have suboptimal performance, and are often outperformed by simple metrics such as model parameter counts or the number of floating-point operations. Besides, existing model-based proxies cannot be generalized to new search spaces with unseen new types of operators without golden accuracy truth. A universally optimal proxy remains elusive. We introduce TG-NAS, a novel model-based universal proxy that leverages a transformer-based operator embedding generator and a graph convolution network (GCN) to predict architecture performance. This approach guides neural architecture search across any given search space without the need of retraining. Distinct from other model-based predictor subroutines, TG-NAS itself acts as a zero-cost (ZC) proxy, guiding architecture search with advantages in terms of data independence, cost-effectiveness, and consistency across diverse search spaces. Our experiments showcase its advantages over existing proxies across various NAS benchmarks, suggesting its potential as a foundational element for efficient architecture search. TG-NAS achieves up to 300X improvements in search efficiency compared to previous SOTA ZC proxy methods. Notably, it discovers competitive models with 93.75% CIFAR-10 accuracy on the NAS-Bench-201 space and 74.5% ImageNet top-1 accuracy on the DARTS space.
- [333] arXiv:2404.00282 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and MethodsComments: 16 pages (including bibliography), 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Abstract: With extensive pre-trained knowledge and high-level general capabilities, large language models (LLMs) emerge as a promising avenue to augment reinforcement learning (RL) in aspects such as multi-task learning, sample efficiency, and task planning. In this survey, we provide a comprehensive review of the existing literature in $\textit{LLM-enhanced RL}$ and summarize its characteristics compared to conventional RL methods, aiming to clarify the research scope and directions for future studies. Utilizing the classical agent-environment interaction paradigm, we propose a structured taxonomy to systematically categorize LLMs' functionalities in RL, including four roles: information processor, reward designer, decision-maker, and generator. Additionally, for each role, we summarize the methodologies, analyze the specific RL challenges that are mitigated, and provide insights into future directions. Lastly, potential applications, prospective opportunities and challenges of the $\textit{LLM-enhanced RL}$ are discussed.
- [334] arXiv:2404.00285 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Long-Tailed Recognition on Binary Networks by Calibrating A Pre-trained ModelSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deploying deep models in real-world scenarios entails a number of challenges, including computational efficiency and real-world (e.g., long-tailed) data distributions. We address the combined challenge of learning long-tailed distributions using highly resource-efficient binary neural networks as backbones. Specifically, we propose a calibrate-and-distill framework that uses off-the-shelf pretrained full-precision models trained on balanced datasets to use as teachers for distillation when learning binary networks on long-tailed datasets. To better generalize to various datasets, we further propose a novel adversarial balancing among the terms in the objective function and an efficient multiresolution learning scheme. We conducted the largest empirical study in the literature using 15 datasets, including newly derived long-tailed datasets from existing balanced datasets, and show that our proposed method outperforms prior art by large margins (>14.33% on average).
- [335] arXiv:2404.00306 (cross-list from cs.CE) [ pdf , ps , other ]
-
Title: Leveraging Intelligent Recommender system as a first step resilience measure -- A data-driven supply chain disruption response frameworkComments: Manuscript submitted for WSC2024 ConferenceSubjects: Computational Engineering, Finance, and Science (cs.CE) ; Artificial Intelligence (cs.AI)
Abstract: Interests in the value of digital technologies for its potential uses to increase supply chain resilience (SCRes) are increasing in light to the industry 4.0 and the global pandemic. Utilization of Recommender systems (RS) as a supply chain (SC) resilience measure is neglected although RS is a capable tool to enhance SC resilience from a reactive aspect. To address this problem, this research proposed a novel data-driven supply chain disruption response framework based on the intelligent recommender system techniques and validated the conceptual model through a practical use case. Results show that our framework can be implemented as an effective SC disruption mitigation measure in the very first response phrase and help SC participants get better reaction performance after the SC disruption.
- [336] arXiv:2404.00312 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Bayesian Exploration of Pre-trained Models for Low-shot Image ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Low-shot image classification is a fundamental task in computer vision, and the emergence of large-scale vision-language models such as CLIP has greatly advanced the forefront of research in this field. However, most existing CLIP-based methods lack the flexibility to effectively incorporate other pre-trained models that encompass knowledge distinct from CLIP. To bridge the gap, this work proposes a simple and effective probabilistic model ensemble framework based on Gaussian processes, which have previously demonstrated remarkable efficacy in processing small data. We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function with an ensemble of deep kernels built upon various pre-trained models. By regressing the classification label directly, our framework enables analytical inference, straightforward uncertainty quantification, and principled hyper-parameter tuning. Through extensive experiments on standard benchmarks, we demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance. Additionally, we assess the robustness of our method and the quality of the yielded uncertainty estimates on out-of-distribution datasets. We also illustrate that our method, despite relying on label regression, still enjoys superior model calibration compared to most deterministic baselines.
- [337] arXiv:2404.00330 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Memory-Scalable and Simplified Functional Map LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deep functional maps have emerged in recent years as a prominent learning-based framework for non-rigid shape matching problems. While early methods in this domain only focused on learning in the functional domain, the latest techniques have demonstrated that by promoting consistency between functional and pointwise maps leads to significant improvements in accuracy. Unfortunately, existing approaches rely heavily on the computation of large dense matrices arising from soft pointwise maps, which compromises their efficiency and scalability. To address this limitation, we introduce a novel memory-scalable and efficient functional map learning pipeline. By leveraging the specific structure of functional maps, we offer the possibility to achieve identical results without ever storing the pointwise map in memory. Furthermore, based on the same approach, we present a differentiable map refinement layer adapted from an existing axiomatic refinement algorithm. Unlike many functional map learning methods, which use this algorithm at a post-processing step, ours can be easily used at train time, enabling to enforce consistency between the refined and initial versions of the map. Our resulting approach is both simpler, more efficient and more numerically stable, by avoiding differentiation through a linear system, while achieving close to state-of-the-art results in challenging scenarios.
- [338] arXiv:2404.00344 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeAnkit Satpute , Noah Giessing , Andre Greiner-Petter , Moritz Schubotz , Olaf Teschke , Akiko Aizawa , Bela GippComments: Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) July 14--18, 2024, Washington D.C.,USASubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{ this https URL }
- [339] arXiv:2404.00364 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Accurate Cutting-point Estimation for Robotic Lychee Harvesting through Geometry-aware LearningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Accurately identifying lychee-picking points in unstructured orchard environments and obtaining their coordinate locations is critical to the success of lychee-picking robots. However, traditional two-dimensional (2D) image-based object detection methods often struggle due to the complex geometric structures of branches, leaves and fruits, leading to incorrect determination of lychee picking points. In this study, we propose a Fcaf3d-lychee network model specifically designed for the accurate localisation of lychee picking points. Point cloud data of lychee picking points in natural environments are acquired using Microsoft's Azure Kinect DK time-of-flight (TOF) camera through multi-view stitching. We augment the Fully Convolutional Anchor-Free 3D Object Detection (Fcaf3d) model with a squeeze-and-excitation(SE) module, which exploits human visual attention mechanisms for improved feature extraction of lychee picking points. The trained network model is evaluated on a test set of lychee-picking locations and achieves an impressive F1 score of 88.57%, significantly outperforming existing models. Subsequent three-dimensional (3D) position detection of picking points in real lychee orchard environments yields high accuracy, even under varying degrees of occlusion. Localisation errors of lychee picking points are within 1.5 cm in all directions, demonstrating the robustness and generality of the model.
- [340] arXiv:2404.00369 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Worker Robot Cooperation and Integration into the Manufacturing Workcell via the Holonic Control ArchitectureSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract: Worker-Robot Cooperation is a new industrial trend, which aims to sum the advantages of both the human and the industrial robot to afford a new intelligent manufacturing techniques. The cooperative manufacturing between the worker and the robot contains other elements such as the product parts and the manufacturing tools. All these production elements must cooperate in one manufacturing workcell to fulfill the production requirements. The manufacturing control system is the mean to connect all these cooperative elements together in one body. This manufacturing control system is distributed and autonomous due to the nature of the cooperative workcell. Accordingly, this article proposes the holonic control architecture as the manufacturing concept of the cooperative workcell. Furthermore, the article focuses on the feasibility of this manufacturing concept, by applying it over a case study that involves the cooperation between a dual-arm robot and a worker. During this case study, the worker uses a variety of hand gestures to cooperate with the robot to achieve the highest production flexibility
- [341] arXiv:2404.00383 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: SpikingJET: Enhancing Fault Injection for Fully and Convolutional Spiking Neural NetworksAnil Bayram Gogebakan , Enrico Magliano , Alessio Carpegna , Annachiara Ruospo , Alessandro Savino , Stefano Di CarloSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: As artificial neural networks become increasingly integrated into safety-critical systems such as autonomous vehicles, devices for medical diagnosis, and industrial automation, ensuring their reliability in the face of random hardware faults becomes paramount. This paper introduces SpikingJET, a novel fault injector designed specifically for fully connected and convolutional Spiking Neural Networks (SNNs). Our work underscores the critical need to evaluate the resilience of SNNs to hardware faults, considering their growing prominence in real-world applications. SpikingJET provides a comprehensive platform for assessing the resilience of SNNs by inducing errors and injecting faults into critical components such as synaptic weights, neuron model parameters, internal states, and activation functions. This paper demonstrates the effectiveness of Spiking-JET through extensive software-level experiments on various SNN architectures, revealing insights into their vulnerability and resilience to hardware faults. Moreover, highlighting the importance of fault resilience in SNNs contributes to the ongoing effort to enhance the reliability and safety of Neural Network (NN)-powered systems in diverse domains.
- [342] arXiv:2404.00385 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Constrained Layout Generation with Factor GraphsMohammed Haroon Dupty , Yanfei Dong , Sicong Leng , Guoji Fu , Yong Liang Goh , Wei Lu , Wee Sun LeeComments: To be published at IEEE/CVF CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper addresses the challenge of object-centric layout generation under spatial constraints, seen in multiple domains including floorplan design process. The design process typically involves specifying a set of spatial constraints that include object attributes like size and inter-object relations such as relative positioning. Existing works, which typically represent objects as single nodes, lack the granularity to accurately model complex interactions between objects. For instance, often only certain parts of an object, like a room's right wall, interact with adjacent objects. To address this gap, we introduce a factor graph based approach with four latent variable nodes for each room, and a factor node for each constraint. The factor nodes represent dependencies among the variables to which they are connected, effectively capturing constraints that are potentially of a higher order. We then develop message-passing on the bipartite graph, forming a factor graph neural network that is trained to produce a floorplan that aligns with the desired requirements. Our approach is simple and generates layouts faithful to the user requirements, demonstrated by a large improvement in IOU scores over existing methods. Additionally, our approach, being inferential and accurate, is well-suited to the practical human-in-the-loop design process where specifications evolve iteratively, offering a practical and powerful tool for AI-guided design.
- [343] arXiv:2404.00399 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive OrderTaishi Nakamura , Mayank Mishra , Simone Tedeschi , Yekun Chai , Jason T Stillerman , Felix Friedrich , Prateek Yadav , Tanmay Laud , Vu Minh Chien , Terry Yue Zhuo , Diganta Misra , Ben Bogin , Xuan-Son Vu , Marzena Karpinska , Arnav Varma Dantuluri , Wojciech Kusa , Tommaso Furlanello , Rio Yokota , Niklas Muennighoff , Suhas Pai , Tosin Adewumi , Veronika Laippala , Xiaozhe Yao , Adalberto Junior , Alpay Ariyak , Aleksandr Drozd , Jordan Clive , Kshitij Gupta , Liangyu Chen , Qi Sun , Ken Tsui , Noah Persaud , Nour Fahmy , Tianlong Chen , Mohit Bansal , Nicolo Monti , Tai Dang , Ziyang Luo , Tien-Tung Bui , Roberto Navigli , Virendra Mehta , Matthew Blumberg , Victor May , Huu Nguyen , Sampo PyysaloComments: PreprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at this https URL .
- [344] arXiv:2404.00406 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: TACO -- Twitter Arguments from COnversationsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Twitter has emerged as a global hub for engaging in online conversations and as a research corpus for various disciplines that have recognized the significance of its user-generated content. Argument mining is an important analytical task for processing and understanding online discourse. Specifically, it aims to identify the structural elements of arguments, denoted as information and inference. These elements, however, are not static and may require context within the conversation they are in, yet there is a lack of data and annotation frameworks addressing this dynamic aspect on Twitter. We contribute TACO, the first dataset of Twitter Arguments utilizing 1,814 tweets covering 200 entire conversations spanning six heterogeneous topics annotated with an agreement of 0.718 Krippendorff's alpha among six experts. Second, we provide our annotation framework, incorporating definitions from the Cambridge Dictionary, to define and identify argument components on Twitter. Our transformer-based classifier achieves an 85.06\% macro F1 baseline score in detecting arguments. Moreover, our data reveals that Twitter users tend to engage in discussions involving informed inferences and information. TACO serves multiple purposes, such as training tweet classifiers to manage tweets based on inference and information elements, while also providing valuable insights into the conversational reply patterns of tweets.
- [345] arXiv:2404.00413 (cross-list from physics.space-ph) [ pdf , ps , html , other ]
-
Title: Language Models are Spacecraft OperatorsVictor Rodriguez-Fernandez , Alejandro Carrasco , Jason Cheng , Eli Scharf , Peng Mun Siew , Richard LinaresComments: Source code available on Github at: this https URLSubjects: Space Physics (physics.space-ph) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Guidance, Navigation, and Control in space, enabling LLMs to have a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. Code is available at this https URL .
- [346] arXiv:2404.00417 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-DistillationComments: CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: To accommodate real-world dynamics, artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task, Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However, a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end, we introduce a novel approach, Multi-level Online Sequential Experts (MOSE), which cultivates the model as stacked sub-experts, integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts, thereby significantly advancing OCL performance over state-of-the-art baselines (e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).
- [347] arXiv:2404.00424 (cross-list from q-fin.MF) [ pdf , ps , html , other ]
-
Title: From attention to profit: quantitative trading strategy based on transformerSubjects: Mathematical Finance (q-fin.MF) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract: In traditional quantitative trading practice, navigating the complicated and dynamic financial market presents a persistent challenge. Former machine learning approaches have struggled to fully capture various market variables, often ignore long-term information and fail to catch up with essential signals that may lead the profit. This paper introduces an enhanced transformer architecture and designs a novel factor based on the model. By transfer learning from sentiment analysis, the proposed model not only exploits its original inherent advantages in capturing long-range dependencies and modelling complex data relationships but is also able to solve tasks with numerical inputs and accurately forecast future returns over a period. This work collects more than 5,000,000 rolling data of 4,601 stocks in the Chinese capital market from 2010 to 2019. The results of this study demonstrated the model's superior performance in predicting stock trends compared with other 100 factor-based quantitative strategies with lower turnover rates and a more robust half-life period. Notably, the model's innovative use transformer to establish factors, in conjunction with market sentiment information, has been shown to enhance the accuracy of trading signals significantly, thereby offering promising implications for the future of quantitative trading strategies.
- [348] arXiv:2404.00437 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Automatic explanation of the classification of Spanish legal judgments in jurisdiction-dependent law categories with tree estimatorsJaime González-González , Francisco de Arriba-Pérez , Silvia García-Méndez , Andrea Busto-Castiñeira , Francisco J. González-CastañoSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automatic legal text classification systems have been proposed in the literature to address knowledge extraction from judgments and detect their aspects. However, most of these systems are black boxes even when their models are interpretable. This may raise concerns about their trustworthiness. Accordingly, this work contributes with a system combining Natural Language Processing (NLP) with Machine Learning (ML) to classify legal texts in an explainable manner. We analyze the features involved in the decision and the threshold bifurcation values of the decision paths of tree structures and present this information to the users in natural language. This is the first work on automatic analysis of legal texts combining NLP and ML along with Explainable Artificial Intelligence techniques to automatically make the models' decisions understandable to end users. Furthermore, legal experts have validated our solution, and this knowledge has also been incorporated into the explanation process as "expert-in-the-loop" dictionaries. Experimental results on an annotated data set in law categories by jurisdiction demonstrate that our system yields competitive classification performance, with accuracy values well above 90%, and that its automatic explanations are easily understandable even to non-expert users.
- [349] arXiv:2404.00438 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Communication Efficient Distributed Training with Distributed LionBo Liu , Lemeng Wu , Lizhang Chen , Kaizhao Liang , Jiaxu Zhu , Chen Liang , Raghuraman Krishnamoorthi , Qiang LiuComments: 22 pagesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.
- [350] arXiv:2404.00442 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Interactive Multi-Robot Flocking with Gesture Responsiveness and Musical AccompanimentCatie Cuan , Kyle Jeffrey , Kim Kleiven , Adrian Li , Emre Fisher , Matt Harrison , Benjie Holson , Allison Okamura , Matt BenniceSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: For decades, robotics researchers have pursued various tasks for multi-robot systems, from cooperative manipulation to search and rescue. These tasks are multi-robot extensions of classical robotic tasks and often optimized on dimensions such as speed or efficiency. As robots transition from commercial and research settings into everyday environments, social task aims such as engagement or entertainment become increasingly relevant. This work presents a compelling multi-robot task, in which the main aim is to enthrall and interest. In this task, the goal is for a human to be drawn to move alongside and participate in a dynamic, expressive robot flock. Towards this aim, the research team created algorithms for robot movements and engaging interaction modes such as gestures and sound. The contributions are as follows: (1) a novel group navigation algorithm involving human and robot agents, (2) a gesture responsive algorithm for real-time, human-robot flocking interaction, (3) a weight mode characterization system for modifying flocking behavior, and (4) a method of encoding a choreographer's preferences inside a dynamic, adaptive, learned system. An experiment was performed to understand individual human behavior while interacting with the flock under three conditions: weight modes selected by a human choreographer, a learned model, or subset list. Results from the experiment showed that the perception of the experience was not influenced by the weight mode selection. This work elucidates how differing task aims such as engagement manifest in multi-robot system design and execution, and broadens the domain of multi-robot tasks.
- [351] arXiv:2404.00450 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Planning and Editing What You Retrieve for Enhanced Tool LearningComments: This paper is accepted at NAACL-Findings 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Recent advancements in integrating external tools with Large Language Models (LLMs) have opened new frontiers, with applications in mathematical reasoning, code generators, and smart assistants. However, existing methods, relying on simple one-time retrieval strategies, fall short on effectively and accurately shortlisting relevant tools. This paper introduces a novel PLUTO (Planning, Learning, and Understanding for TOols) approach, encompassing `Plan-and-Retrieve (P&R)` and `Edit-and-Ground (E&G)` paradigms. The P&R paradigm consists of a neural retrieval module for shortlisting relevant tools and an LLM-based query planner that decomposes complex queries into actionable tasks, enhancing the effectiveness of tool utilization. The E&G paradigm utilizes LLMs to enrich tool descriptions based on user scenarios, bridging the gap between user queries and tool functionalities. Experiment results demonstrate that these paradigms significantly improve the recall and NDCG in tool retrieval tasks, significantly surpassing current state-of-the-art models.
- [352] arXiv:2404.00461 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based LearningComments: 10 pages, 6 figures, conferenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Abstract: Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure correct labeling of poisoned samples, which is more stealthy compared to the poisoned-label attack, but on the other hand, it faces significant issues with false activations and poses greater challenges, necessitating a higher rate of poisoning. Using conventional negative data augmentation methods, we discovered that it is challenging to trade off between effectiveness and stealthiness in a clean-label setting. In addressing this issue, we are inspired by the notion that a backdoor acts as a shortcut and posit that this shortcut stems from the contrast between the trigger and the data utilized for poisoning. In this study, we propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features. With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates. Notably, we found that the two approaches play leading roles in full-shot and few-shot settings, respectively.
- [353] arXiv:2404.00474 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Linguistic Calibration of Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Abstract: Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce text with calibrated confidence statements. Through the lens of decision-making, we formalize linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under distribution shift on question-answering and under a significant task shift to person biography generation. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.
- [354] arXiv:2404.00482 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Cross-lingual Named Entity Corpus for Slavic LanguagesComments: Published in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
- [355] arXiv:2404.00486 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMsShu Yang , Jiayuan Su , Han Jiang , Mengdi Li , Keyuan Cheng , Muhammad Asif Ali , Lijie Hu , Di WangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With the rise of large language models (LLMs), ensuring they embody the principles of being helpful, honest, and harmless (3H), known as Human Alignment, becomes crucial. While existing alignment methods like RLHF, DPO, etc., effectively fine-tune LLMs to match preferences in the preference dataset, they often lead LLMs to highly receptive human input and external evidence, even when this information is poisoned. This leads to a tendency for LLMs to be Adaptive Chameleons when external evidence conflicts with their parametric memory. This exacerbates the risk of LLM being attacked by external poisoned data, which poses a significant security risk to LLM system applications such as Retrieval-augmented generation (RAG). To address the challenge, we propose a novel framework: Dialectical Alignment (DA), which (1) utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts and context-memory conflicts with different external evidence in context window (i.e., different ratios of poisoned factual contexts); (2) constructs the SFT dataset as well as the preference dataset based on the AI feedback and strategies above; (3) uses the above datasets for LLM alignment to defense poisoned context attack while preserving the effectiveness of in-context knowledge editing. Our experiments show that the dialectical alignment model improves poisoned data attack defense by 20 and does not require any additional prompt engineering or prior declaration of ``you may be attacked`` to the LLMs' context window.
- [356] arXiv:2404.00487 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Contextual AI Journaling: Integrating LLM and Time Series Behavioral Sensing Technology to Promote Self-Reflection and Well-being using the MindScape AppSubigya Nepal , Arvind Pillai , William Campbell , Talie Massachi , Eunsol Soul Choi , Orson Xu , Joanna Kuc , Jeremy Huckins , Jason Holden , Colin Depp , Nicholas Jacobson , Mary Czerwinski , Eric Granholm , Andrew T. CampbellSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: MindScape aims to study the benefits of integrating time series behavioral patterns (e.g., conversational engagement, sleep, location) with Large Language Models (LLMs) to create a new form of contextual AI journaling, promoting self-reflection and well-being. We argue that integrating behavioral sensing in LLMs will likely lead to a new frontier in AI. In this Late-Breaking Work paper, we discuss the MindScape contextual journal App design that uses LLMs and behavioral sensing to generate contextual and personalized journaling prompts crafted to encourage self-reflection and emotional development. We also discuss the MindScape study of college students based on a preliminary user study and our upcoming study to assess the effectiveness of contextual AI journaling in promoting better well-being on college campuses. MindScape represents a new application class that embeds behavioral intelligence in AI.
- [357] arXiv:2404.00488 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Noise-Aware Training of Layout-Aware Language ModelsRitesh Sarkhel , Xiaoqi Ren , Lauro Beltrao Costa , Guolong Su , Vincent Perot , Yanan Xie , Emmanouil Koukoumidis , Arnab NandiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of instances of the target document type annotated at textual and visual modalities. This is an expensive bottleneck in enterprise scenarios, where we want to train custom extractors for thousands of different document types in a scalable way. Pre-training an extractor model on unlabeled instances of the target document type, followed by a fine-tuning step on human-labeled instances does not work in these scenarios, as it surpasses the maximum allowable training time allocated for the extractor. We address this scenario by proposing a Noise-Aware Training method or NAT in this paper. Instead of acquiring expensive human-labeled documents, NAT utilizes weakly labeled documents to train an extractor in a scalable way. To avoid degradation in the model's quality due to noisy, weakly labeled samples, NAT estimates the confidence of each training sample and incorporates it as uncertainty measure during training. We train multiple state-of-the-art extractor models using NAT. Experiments on a number of publicly available and in-house datasets show that NAT-trained models are not only robust in performance -- it outperforms a transfer-learning baseline by up to 6% in terms of macro-F1 score, but it is also more label-efficient -- it reduces the amount of human-effort required to obtain comparable performance by up to 73%.
- [358] arXiv:2404.00489 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PROMPT-SAW: Leveraging Relation-Aware Graphs for Textual Prompt CompressionMuhammad Asif Ali , Zhengping Li , Shu Yang , Keyuan Cheng , Yang Cao , Tianhao Huang , Lijie Hu , Lu Yu , Di WangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have shown exceptional abilities for multiple different natural language processing tasks. While prompting is a crucial tool for LLM inference, we observe that there is a significant cost associated with exceedingly lengthy prompts. Existing attempts to compress lengthy prompts lead to sub-standard results in terms of readability and interpretability of the compressed prompt, with a detrimental impact on prompt utility. To address this, we propose PROMPT-SAW: Prompt compresSion via Relation AWare graphs, an effective strategy for prompt compression over task-agnostic and task-aware prompts. PROMPT-SAW uses the prompt's textual information to build a graph, later extracts key information elements in the graph to come up with the compressed prompt. We also propose GSM8K-AUG, i.e., an extended version of the existing GSM8k benchmark for task-agnostic prompts in order to provide a comprehensive evaluation platform. Experimental evaluation using benchmark datasets shows that prompts compressed by PROMPT-SAW are not only better in terms of readability, but they also outperform the best-performing baseline models by up to 14.3 and 13.7 respectively for task-aware and task-agnostic settings while compressing the original prompt text by 33.0 and 56.7.
- [359] arXiv:2404.00492 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multi-hop Question Answering under Temporal Knowledge EditingKeyuan Cheng , Gang Lin , Haoyang Fei , Yuxuan zhai , Lu Yu , Muhammad Asif Ali , Lijie Hu , Di WangComments: 23 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Multi-hop question answering (MQA) under knowledge editing (KE) has garnered significant attention in the era of large language models. However, existing models for MQA under KE exhibit poor performance when dealing with questions containing explicit temporal contexts. To address this limitation, we propose a novel framework, namely TEMPoral knowLEdge augmented Multi-hop Question Answering (TEMPLE-MQA). Unlike previous methods, TEMPLE-MQA first constructs a time-aware graph (TAG) to store edit knowledge in a structured manner. Then, through our proposed inference path, structural retrieval, and joint reasoning stages, TEMPLE-MQA effectively discerns temporal contexts within the question query. Experiments on benchmark datasets demonstrate that TEMPLE-MQA significantly outperforms baseline models. Additionally, we contribute a new dataset, namely TKEMQA, which serves as the inaugural benchmark tailored specifically for MQA with temporal scopes.
- [360] arXiv:2404.00495 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Configurable Safety Tuning of Language Models with Synthetic Preference DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: State-of-the-art language model fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constraints of vanilla DPO by introducing a system prompt specifying safety configurations, enabling LLM deployers to disable/enable safety preferences based on their need, just changing the system prompt. Our experimental evaluations indicate that CST successfully manages different safety configurations and retains the original functionality of LLMs, showing it is a robust method for configurable deployment. Data and models available at this https URL
- [361] arXiv:2404.00505 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Transfer Learning with Reconstruction LossComments: 16 pages, 5 figures. To appear in IEEE Transactions on Machine Learning in Communications and Networking (TMLCN)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Machine Learning (stat.ML)
Abstract: In most applications of utilizing neural networks for mathematical optimization, a dedicated model is trained for each specific optimization objective. However, in many scenarios, several distinct yet correlated objectives or tasks often need to be optimized on the same set of problem inputs. Instead of independently training a different neural network for each problem separately, it would be more efficient to exploit the correlations between these objectives and to train multiple neural network models with shared model parameters and feature representations. To achieve this, this paper first establishes the concept of common information: the shared knowledge required for solving the correlated tasks, then proposes a novel approach for model training by adding into the model an additional reconstruction stage associated with a new reconstruction loss. This loss is for reconstructing the common information starting from a selected hidden layer in the model. The proposed approach encourages the learned features to be general and transferable, and therefore can be readily used for efficient transfer learning. For numerical simulations, three applications are studied: transfer learning on classifying MNIST handwritten digits, the device-to-device wireless network power allocation, and the multiple-input-single-output network downlink beamforming and localization. Simulation results suggest that the proposed approach is highly efficient in data and model complexity, is resilient to over-fitting, and has competitive performances.
- [362] arXiv:2404.00526 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: The Emotional Impact of Game Duration: A Framework for Understanding Player Emotions in Extended Gameplay SessionsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Video games have played a crucial role in entertainment since their development in the 1970s, becoming even more prominent during the lockdown period when people were looking for ways to entertain them. However, at that time, players were unaware of the significant impact that playtime could have on their feelings. This has made it challenging for designers and developers to create new games since they have to control the emotional impact that these games will take on players. Thus, the purpose of this study is to look at how a player's emotions are affected by the duration of the game. In order to achieve this goal, a framework for emotion detection is created. According to the experiment's results, the volunteers' general ability to express emotions increased from 20 to 60 minutes. In comparison to shorter gameplay sessions, the experiment found that extended gameplay sessions did significantly affect the player's emotions. According to the results, it was recommended that in order to lessen the potential emotional impact that playing computer and video games may have in the future, game producers should think about creating shorter, entertaining games.
- [363] arXiv:2404.00530 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference OptimizationComments: 25 pages, 14 figures, 5 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This only leverages the pairwise comparisons when the generations are placed in an identical context. However, such conditional rankings often fail to capture the complex and multidimensional aspects of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. While prior preference optimizations are designed for conditional ranking protocols (e.g., DPO), our proposed preference acquisition protocol introduces DOVE, a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, we find that the LLM trained with joint instruction-response preference data using DOVE outperforms the LLM trained with DPO by 5.2% and 3.3% win-rate for the summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at this https URL .
- [364] arXiv:2404.00540 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial PatchesComments: 27pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The vulnerability of deep neural networks to adversarial patches has motivated numerous defense strategies for boosting model robustness. However, the prevailing defenses depend on single observation or pre-established adversary information to counter adversarial patches, often failing to be confronted with unseen or adaptive adversarial attacks and easily exhibiting unsatisfying performance in dynamic 3D environments. Inspired by active human perception and recurrent feedback mechanisms, we develop Embodied Active Defense (EAD), a proactive defensive strategy that actively contextualizes environmental information to address misaligned adversarial patches in 3D real-world settings. To achieve this, EAD develops two central recurrent sub-modules, i.e., a perception module and a policy module, to implement two critical functions of active vision. These models recurrently process a series of beliefs and observations, facilitating progressive refinement of their comprehension of the target object and enabling the development of strategic actions to counter adversarial patches in 3D environments. To optimize learning efficiency, we incorporate a differentiable approximation of environmental dynamics and deploy patches that are agnostic to the adversary strategies. Extensive experiments demonstrate that EAD substantially enhances robustness against a variety of patches within just a few steps through its action policy in safety-critical tasks (e.g., face recognition and object detection), without compromising standard accuracy. Furthermore, due to the attack-agnostic characteristic, EAD facilitates excellent generalization to unseen attacks, diminishing the averaged attack success rate by 95 percent across a range of unseen adversarial attacks.
- [365] arXiv:2404.00544 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Deep Extrinsic Manifold Representation for Vision TasksSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Non-Euclidean data is frequently encountered across different fields, yet there is limited literature that addresses the fundamental challenge of training neural networks with manifold representations as outputs. We introduce the trick named Deep Extrinsic Manifold Representation (DEMR) for visual tasks in this context. DEMR incorporates extrinsic manifold embedding into deep neural networks, which helps generate manifold representations. The DEMR approach does not directly optimize the complex geodesic loss. Instead, it focuses on optimizing the computation graph within the embedded Euclidean space, allowing for adaptability to various architectural requirements. We provide empirical evidence supporting the proposed concept on two types of manifolds, $SE(3)$ and its associated quotient manifolds. This evidence offers theoretical assurances regarding feasibility, asymptotic properties, and generalization capability. The experimental results show that DEMR effectively adapts to point cloud alignment, producing outputs in $ SE(3) $, as well as in illumination subspace learning with outputs on the Grassmann manifold.
- [366] arXiv:2404.00576 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Automated Bi-Fold Weighted Ensemble Algorithms and its Application to Brain Tumor Detection and ClassificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The uncontrolled and unstructured growth of brain cells is known as brain tumor, which has one of the highest mortality rates among diseases from all types of cancers. Due to limited diagnostic and treatment capabilities, they pose significant challenges, especially in third-world countries. Early diagnosis plays a vital role in effectively managing brain tumors and reducing mortality rates. However, the availability of diagnostic methods is hindered by various limitations, including high costs and lengthy result acquisition times, impeding early detection of the disease. In this study, we present two cutting-edge bi-fold weighted voting ensemble models that aim to boost the effectiveness of weighted ensemble methods. These two proposed methods combine the classification outcomes from multiple classifiers and determine the optimal result by selecting the one with the highest probability in the first approach, and the highest weighted prediction in the second technique. These approaches significantly improve the overall performance of weighted ensemble techniques. In the first proposed method, we improve the soft voting technique (SVT) by introducing a novel unsupervised weight calculating schema (UWCS) to enhance its weight assigning capability, known as the extended soft voting technique (ESVT). Secondly, we propose a novel weighted method (NWM) by using the proposed UWCS. Both of our approaches incorporate three distinct models: a custom-built CNN, VGG-16, and InceptionResNetV2 which has been trained on publicly available datasets. The effectiveness of our proposed systems is evaluated through blind testing, where exceptional results are achieved. We then establish a comparative analysis of the performance of our proposed methods with that of SVT to show their superiority and effectiveness.
- [367] arXiv:2404.00579 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys)Yashar Deldjoo , Zhankui He , Julian McAuley , Anton Korikov , Scott Sanner , Arnau Ramisa , René Vidal , Maheswaran Sathiamoorthy , Atoosa Kasirzadeh , Silvia MilanoSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Traditional recommender systems (RS) have used user-item rating histories as their primary data source, with collaborative filtering being one of the principal methods. However, generative models have recently developed abilities to model and sample from complex data distributions, including not only user-item interaction histories but also text, images, and videos - unlocking this rich data for novel recommendation tasks. Through this comprehensive and multi-disciplinary survey, we aim to connect the key advancements in RS using Generative Models (Gen-RecSys), encompassing: a foundational overview of interaction-driven generative models; the application of large language models (LLM) for generative recommendation, retrieval, and conversational recommendation; and the integration of multimodal models for processing and generating image and video content in RS. Our holistic perspective allows us to highlight necessary paradigms for evaluating the impact and harm of Gen-RecSys and identify open challenges. A more up-to-date version of the papers is maintained at: this https URL .
- [368] arXiv:2404.00588 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Memory-based Cross-modal Semantic Alignment Network for Radiology Report GenerationComments: 12 pages, 8 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generating radiology reports automatically reduces the workload of radiologists and helps the diagnoses of specific diseases. Many existing methods take this task as modality transfer process. However, since the key information related to disease accounts for a small proportion in both image and report, it is hard for the model to learn the latent relation between the radiology image and its report, thus failing to generate fluent and accurate radiology reports. To tackle this problem, we propose a memory-based cross-modal semantic alignment model (MCSAM) following an encoder-decoder paradigm. MCSAM includes a well initialized long-term clinical memory bank to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. To ensure the semantic consistency of the retrieved cross modal prior knowledge, a cross-modal semantic alignment module (SAM) is proposed. SAM is also able to generate semantic visual feature embeddings which can be added to the decoder and benefits report generation. More importantly, to memorize the state and additional information while generating reports with the decoder, we use learnable memory tokens which can be seen as prompts. Extensive experiments demonstrate the promising performance of our proposed method which generates state-of-the-art performance on the MIMIC-CXR dataset.
- [369] arXiv:2404.00593 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LAESI: Leaf Area Estimation with Synthetic ImageryJacek Kałużny , Yannik Schreckenberg , Karol Cyganik , Peter Annighöfer , Sören Pirk , Dominik L. Michels , Mikolaj Cieslak , Farhah Assaad-Gerbert , Bedrich Benes , Wojciech PałubickiComments: 10 pages, 12 figures, 1 tableSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: We introduce LAESI, a Synthetic Leaf Dataset of 100,000 synthetic leaf images on millimeter paper, each with semantic masks and surface area labels. This dataset provides a resource for leaf morphology analysis primarily aimed at beech and oak leaves. We evaluate the applicability of the dataset by training machine learning models for leaf surface area prediction and semantic segmentation, using real images for validation. Our validation shows that these models can be trained to predict leaf surface area with a relative error not greater than an average human annotator. LAESI also provides an efficient framework based on 3D procedural models and generative AI for the large-scale, controllable generation of data with potential further applications in agriculture and biology. We evaluate the inclusion of generative AI in our procedural data generation pipeline and show how data filtering based on annotation consistency results in datasets which allow training the highest performing vision models.
- [370] arXiv:2404.00599 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code RepositoriesComments: Data: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.
- [371] arXiv:2404.00600 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: AI Act and Large Language Models (LLMs): When critical issues and privacy impact require human and ethical oversightSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The imposing evolution of artificial intelligence systems and, specifically, of Large Language Models (LLM) makes it necessary to carry out assessments of their level of risk and the impact they may have in the area of privacy, personal data protection and at an ethical level, especially on the weakest and most vulnerable. This contribution addresses human oversight, ethical oversight, and privacy impact assessment.
- [372] arXiv:2404.00604 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Extensive Self-Contrast Enables Feedback-Free Language Model AlignmentSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Reinforcement learning from human feedback (RLHF) has been a central technique for recent large language model (LLM) alignment. However, its heavy dependence on costly human or LLM-as-Judge preference feedback could stymie its wider applications. In this work, we introduce Self-Contrast, a feedback-free large language model alignment method via exploiting extensive self-generated negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast leverages the LLM itself to generate massive diverse candidates, and harnesses a pre-trained embedding model to filter multiple negatives according to text similarity. Theoretically, we illustrate that in this setting, merely scaling negative responses can still effectively approximate situations with more balanced positive and negative preference annotations. Our experiments with direct preference optimization (DPO) on three datasets show that, Self-Contrast could consistently outperform SFT and standard DPO training by large margins. And as the number of self-generated negatives increases, the performance of Self-Contrast continues to grow. Code and data are available at this https URL .
- [373] arXiv:2404.00614 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Learning to Plan for Language Modeling from Unlabeled DataComments: under reviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. By conditioning on generated latent plans, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.
- [374] arXiv:2404.00636 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait AnimationComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: In this paper, we present Export3D, a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator that directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different view through a differentiable volume rendering. Existing portrait animation methods heavily rely on image warping to transfer the expression in the motion space, challenging on disentanglement of appearance and expression. In contrast, we propose a contrastive pre-training framework for appearance-free expression parameter, eliminating undesirable appearance swap when transferring a cross-identity expression. Extensive experiments show that our pre-training framework can learn the appearance-free expression representation hidden in 3DMM, and our model can generate 3D-aware expression controllable portrait image without appearance swap in the cross-identity manner.
- [375] arXiv:2404.00651 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning Off-policy with Model-based Intrinsic Motivation For Active Online ExplorationComments: PreprintSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in deep reinforcement learning (RL) have demonstrated notable progress in sample efficiency, spanning both model-based and model-free paradigms. Despite the identification and mitigation of specific bottlenecks in prior works, the agent's exploration ability remains under-emphasized in the realm of sample-efficient RL. This paper investigates how to achieve sample-efficient exploration in continuous control tasks. We introduce an RL algorithm that incorporates a predictive model and off-policy learning elements, where an online planner enhanced by a novelty-aware terminal value function is employed for sample collection. Leveraging the forward predictive error within a latent state space, we derive an intrinsic reward without incurring parameters overhead. This reward establishes a solid connection to model uncertainty, allowing the agent to effectively overcome the asymptotic performance gap. Through extensive experiments, our method shows competitive or even superior performance compared to prior works, especially the sparse reward cases.
- [376] arXiv:2404.00656 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: WavLLM: Towards Robust and Adaptive Speech Large Language ModelShujie Hu , Long Zhou , Shujie Liu , Sanyuan Chen , Hongkun Hao , Jing Pan , Xunying Liu , Jinyu Li , Sunit Sivasankaran , Linquan Liu , Furu WeiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{ this http URL }.
- [377] arXiv:2404.00657 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Observations on Building RAG Systems for Technical DocumentsComments: Published as a Tiny Paper at ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Retrieval augmented generation (RAG) for technical documents creates challenges as embeddings do not often capture domain information. We review prior art for important factors affecting RAG and perform experiments to highlight best practices and potential challenges to build RAG systems for technical documents.
- [378] arXiv:2404.00672 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A General and Efficient Training for Transformer via Token ExpansionWenxuan Huang , Yunhang Shen , Jiao Xie , Baochang Zhang , Gaoqi He , Ke Li , Xing Sun , Shaohui LinComments: Accepted to CVPR 2024. Code is available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at this https URL .
- [379] arXiv:2404.00673 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and CountermeasuresThanh Tam Nguyen , Thanh Trung Huynh , Zhao Ren , Thanh Toan Nguyen , Phi Le Nguyen , Hongzhi Yin , Quoc Viet Hung NguyenSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: As the adoption of explainable AI (XAI) continues to expand, the urgency to address its privacy implications intensifies. Despite a growing corpus of research in AI privacy and explainability, there is little attention on privacy-preserving model explanations. This article presents the first thorough survey about privacy attacks on model explanations and their countermeasures. Our contribution to this field comprises a thorough analysis of research papers with a connected taxonomy that facilitates the categorisation of privacy attacks and countermeasures based on the targeted explanations. This work also includes an initial investigation into the causes of privacy leaks. Finally, we discuss unresolved issues and prospective research directions uncovered in our analysis. This survey aims to be a valuable resource for the research community and offers clear insights for those new to this domain. To support ongoing research, we have established an online resource repository, which will be continuously updated with new and relevant findings. Interested readers are encouraged to access our repository at this https URL .
- [380] arXiv:2404.00675 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LLM meets Vision-Language Models for Zero-Shot One-Class ClassificationYassir Bendou , Giulia Lioi , Bastien Pasdeloup , Lukas Mauch , Ghouthi Boukli Hacene , Fabien Cardinaux , Vincent GriponSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We consider the problem of zero-shot one-class visual classification. In this setting, only the label of the target class is available, and the goal is to discriminate between positive and negative query samples without requiring any validation example from the target task. We propose a two-step solution that first queries large language models for visually confusing objects and then relies on vision-language pre-trained models (e.g., CLIP) to perform classification. By adapting large-scale vision benchmarks, we demonstrate the ability of the proposed method to outperform adapted off-the-shelf alternatives in this setting. Namely, we propose a realistic benchmark where negative query samples are drawn from the same original dataset as positive ones, including a granularity-controlled version of iNaturalist, where negative samples are at a fixed distance in the taxonomy tree from the positive ones. Our work shows that it is possible to discriminate between a single category and other semantically related ones using only its label
- [381] arXiv:2404.00684 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Generative Retrieval as Multi-Vector Dense RetrievalShiguang Wu , Wenda Wei , Mengqi Zhang , Zhumin Chen , Jun Ma , Zhaochun Ren , Maarten de Rijke , Pengjie RenComments: 12 pages, 5 figures, 8 tables, accepted at SIGIR 2024Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Generative retrieval generates identifiers of relevant documents in an end-to-end manner using a sequence-to-sequence architecture for a given query. The relation between generative retrieval and other retrieval methods, especially those based on matching within dense retrieval models, is not yet fully comprehended. Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval. Accordingly, generative retrieval exhibits behavior analogous to hierarchical search within a tree index in dense retrieval when using hierarchical semantic identifiers. However, prior work focuses solely on the retrieval stage without considering the deep interactions within the decoder of generative retrieval.
In this paper, we fill this gap by demonstrating that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document. Specifically, we examine the attention layer and prediction head of generative retrieval, revealing that generative retrieval can be understood as a special case of multi-vector dense retrieval. Both methods compute relevance as a sum of products of query and document vectors and an alignment matrix. We then explore how generative retrieval applies this framework, employing distinct strategies for computing document token vectors and the alignment matrix. We have conducted experiments to verify our conclusions and show that both paradigms exhibit commonalities of term matching in their alignment matrix. - [382] arXiv:2404.00685 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: Scaling Properties of Speech Language ModelsSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
Abstract: Speech Language Models (SLMs) aim to learn language from raw audio, without textual resources. Despite significant advances, our current models exhibit weak syntax and semantic abilities. However, if the scaling properties of neural language models hold for the speech modality, these abilities will improve as the amount of compute used for training increases. In this paper, we use models of this scaling behavior to estimate the scale at which our current methods will yield a SLM with the English proficiency of text-based Large Language Models (LLMs). We establish a strong correlation between pre-training loss and downstream syntactic and semantic performance in SLMs and LLMs, which results in predictable scaling of linguistic performance. We show that the linguistic performance of SLMs scales up to three orders of magnitude more slowly than that of text-based LLMs. Additionally, we study the benefits of synthetic data designed to boost semantic understanding and the effects of coarser speech tokenization.
- [383] arXiv:2404.00712 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Survey of Computerized Adaptive Testing: A Machine Learning PerspectiveQi Liu , Yan Zhuang , Haoyang Bi , Zhenya Huang , Weizhe Huang , Jiatong Li , Junhao Yu , Zirui Liu , Zirui Hu , Yuting Hong , Zachary A. Pardos , Haiping Ma , Mengxiao Zhu , Shijin Wang , Enhong ChenSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
Abstract: Computerized Adaptive Testing (CAT) provides an efficient and tailored method for assessing the proficiency of examinees, by dynamically adjusting test questions based on their performance. Widely adopted across diverse fields like education, healthcare, sports, and sociology, CAT has revolutionized testing practices. While traditional methods rely on psychometrics and statistics, the increasing complexity of large-scale testing has spurred the integration of machine learning techniques. This paper aims to provide a machine learning-focused survey on CAT, presenting a fresh perspective on this adaptive testing method. By examining the test question selection algorithm at the heart of CAT's adaptivity, we shed light on its functionality. Furthermore, we delve into cognitive diagnosis models, question bank construction, and test control within CAT, exploring how machine learning can optimize these components. Through an analysis of current methods, strengths, limitations, and challenges, we strive to develop robust, fair, and efficient CAT systems. By bridging psychometric-driven CAT research with machine learning, this survey advocates for a more inclusive and interdisciplinary approach to the future of adaptive testing.
- [384] arXiv:2404.00722 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DRCT: Saving Image Super-resolution away from Information BottleneckComments: Camera-ready version, NTIRE 2024 Image Super-resolution (x4)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, Vision Transformer-based approaches for low-level vision tasks have achieved widespread success. Unlike CNN-based models, Transformers are more adept at capturing long-range dependencies, enabling the reconstruction of images utilizing non-local information. In the domain of super-resolution, Swin-transformer-based models have become mainstream due to their capability of global spatial information modeling and their shifting-window attention mechanism that facilitates the interchange of information between different windows. Many researchers have enhanced model performance by expanding the receptive fields or designing meticulous networks, yielding commendable results. However, we observed that it is a general phenomenon for the feature map intensity to be abruptly suppressed to small values towards the network's end. This implies an information bottleneck and a diminishment of spatial information, implicitly limiting the model's potential. To address this, we propose the Dense-residual-connected Transformer (DRCT), aimed at mitigating the loss of spatial information and stabilizing the information flow through dense-residual connections between layers, thereby unleashing the model's potential and saving the model away from information bottleneck. Experiment results indicate that our approach surpasses state-of-the-art methods on benchmark datasets and performs commendably at the NTIRE-2024 Image Super-Resolution (x4) Challenge. Our source code is available at this https URL
- [385] arXiv:2404.00725 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: The Larger the Better? Improved LLM Code-Generation via Budget ReallocationSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model and selecting one. Our findings reveal that, in a standard unit-test setup, the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.
- [386] arXiv:2404.00746 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Mining Weighted Sequential Patterns in Incremental Uncertain DatabasesKashob Kumar Roy , Md Hasibul Haque Moon , Md Mahmudur Rahman , Chowdhury Farhan Ahmed , Carson Kai-Sang LeungComments: Accepted to Information Science journalJournal-ref: Information Sciences 582 (2022): 865-896Subjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI)
Abstract: Due to the rapid development of science and technology, the importance of imprecise, noisy, and uncertain data is increasing at an exponential rate. Thus, mining patterns in uncertain databases have drawn the attention of researchers. Moreover, frequent sequences of items from these databases need to be discovered for meaningful knowledge with great impact. In many real cases, weights of items and patterns are introduced to find interesting sequences as a measure of importance. Hence, a constraint of weight needs to be handled while mining sequential patterns. Besides, due to the dynamic nature of databases, mining important information has become more challenging. Instead of mining patterns from scratch after each increment, incremental mining algorithms utilize previously mined information to update the result immediately. Several algorithms exist to mine frequent patterns and weighted sequences from incremental databases. However, these algorithms are confined to mine the precise ones. Therefore, we have developed an algorithm to mine frequent sequences in an uncertain database in this work. Furthermore, we have proposed two new techniques for mining when the database is incremental. Extensive experiments have been conducted for performance evaluation. The analysis showed the efficiency of our proposed framework.
- [387] arXiv:2404.00748 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Benchmark Transparency: Measuring the Impact of Data on EvaluationComments: Accepted at NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper we present an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. We propose an automated framework that measures the data point distribution across 6 different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity.
We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We experiment on 2 different datasets (SQUAD and MNLI) and test a total of 135 different models (125 on SQUAD and 10 on MNLI). We demonstrate that without explicit control of the data distribution, standard evaluation frameworks are inconsistent and unreliable. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric.
In a second set of experiments, we demonstrate that the impact of data on evaluation is not just observable, but also predictable. We propose to use benchmark transparency as a method for comparing datasets and quantifying the similarity between them. We find that the ``dataset similarity vector'' can be used to predict how well a model generalizes out of distribution. - [388] arXiv:2404.00752 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: On the True Distribution Approximation of Minimum Bayes-Risk DecodingComments: NAACL 2024 (main conference)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation. MBR decoding considers texts sampled from a model as pseudo-references and selects the text with the highest similarity to the others. Therefore, sampling is one of the key elements of MBR decoding, and previous studies reported that the performance varies by sampling methods. From a theoretical standpoint, this performance variation is likely tied to how closely the samples approximate the true distribution of references. However, this approximation has not been the subject of in-depth study. In this study, we propose using anomaly detection to measure the degree of approximation. We first closely examine the performance variation and then show that previous hypotheses about samples do not correlate well with the variation, but our introduced anomaly scores do. The results are the first to empirically support the link between the performance and the core assumption of MBR decoding.
- [389] arXiv:2404.00777 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Privacy-preserving Optics for Enhancing Protection in Face De-identificationComments: Accepted to CVPR 2024. Project Website and Code coming soonSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes, offices, hospitals, etc. The need to access or process personal information for these purposes raises privacy concerns. While software-level solutions like face de-identification provide a good privacy/utility trade-off, they present vulnerabilities to sniffing attacks. In this paper, we propose a hardware-level face de-identification method to solve this vulnerability. Specifically, our approach first learns an optical encoder along with a regression model to obtain a face heatmap while hiding the face identity from the source image. We also propose an anonymization framework that generates a new face using the privacy-preserving image, face heatmap, and a reference face image from a public dataset as input. We validate our approach with extensive simulations and hardware experiments.
- [390] arXiv:2404.00781 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Addressing Loss of Plasticity and Catastrophic Forgetting in Continual LearningComments: Published in the Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). Code is available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues.
- [391] arXiv:2404.00806 (cross-list from econ.GN) [ pdf , ps , html , other ]
-
Title: Algorithmic Collusion by Large Language ModelsSubjects: General Economics (econ.GN) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Abstract: The rise of algorithmic pricing raises concerns of algorithmic collusion. We conduct experiments with algorithmic pricing agents based on Large Language Models (LLMs), and specifically GPT-4. We find that (1) LLM-based agents are adept at pricing tasks, (2) LLM-based pricing agents autonomously collude in oligopoly settings to the detriment of consumers, and (3) variation in seemingly innocuous phrases in LLM instructions ("prompts") may increase collusion. These results extend to auction settings. Our findings underscore the need for antitrust regulation regarding algorithmic pricing, and uncover regulatory challenges unique to LLM-based pricing agents.
- [392] arXiv:2404.00815 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Towards Realistic Scene Generation with LiDAR Diffusion ModelsComments: CVPR 2024. Project link: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Diffusion models (DMs) excel in photo-realistic image synthesis, but their adaptation to LiDAR scene generation poses a substantial hurdle. This is primarily because DMs operating in the point space struggle to preserve the curve-like patterns and 3D geometry of LiDAR scenes, which consumes much of their representation power. In this paper, we propose LiDAR Diffusion Models (LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to capture the realism of LiDAR scenes by incorporating geometric priors into the learning pipeline. Our method targets three major desiderata: pattern realism, geometry realism, and object realism. Specifically, we introduce curve-wise compression to simulate real-world LiDAR patterns, point-wise coordinate supervision to learn scene geometry, and patch-wise encoding for a full 3D object context. With these three core designs, our method achieves competitive performance on unconditional LiDAR generation in 64-beam scenario and state of the art on conditional LiDAR generation, while maintaining high efficiency compared to point-based DMs (up to 107$\times$ faster). Furthermore, by compressing LiDAR scenes into a latent space, we enable the controllability of DMs with various conditions such as semantic maps, camera views, and text prompts.
- [393] arXiv:2404.00816 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous GraphsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Heterogeneous graphs are ubiquitous in real-world applications because they can represent various relationships between different types of entities. Therefore, learning embeddings in such graphs is a critical problem in graph machine learning. However, existing solutions for this problem fail to scale to large heterogeneous graphs due to their high computational complexity. To address this issue, we propose a Multi-Level Embedding framework of nodes on a heterogeneous graph (HeteroMILE) - a generic methodology that allows contemporary graph embedding methods to scale to large graphs. HeteroMILE repeatedly coarsens the large sized graph into a smaller size while preserving the backbone structure of the graph before embedding it, effectively reducing the computational cost by avoiding time-consuming processing operations. It then refines the coarsened embedding to the original graph using a heterogeneous graph convolution neural network. We evaluate our approach using several popular heterogeneous graph datasets. The experimental results show that HeteroMILE can substantially reduce computational time (approximately 20x speedup) and generate an embedding of better quality for link prediction and node classification.
- [394] arXiv:2404.00855 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TSOM: Small Object Motion Detection Neural Network Inspired by Avian Visual CircuitSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Detecting small moving objects in complex backgrounds from an overhead perspective is a highly challenging task for machine vision systems. As an inspiration from nature, the avian visual system is capable of processing motion information in various complex aerial scenes, and its Retina-OT-Rt visual circuit is highly sensitive to capturing the motion information of small objects from high altitudes. However, more needs to be done on small object motion detection algorithms based on the avian visual system. In this paper, we conducted mathematical modeling based on extensive studies of the biological mechanisms of the Retina-OT-Rt visual circuit. Based on this, we proposed a novel tectum small object motion detection neural network (TSOM). The neural network includes the retina, SGC dendritic, SGC Soma, and Rt layers, each layer corresponding to neurons in the visual pathway. The Retina layer is responsible for accurately projecting input content, the SGC dendritic layer perceives and encodes spatial-temporal information, the SGC Soma layer computes complex motion information and extracts small objects, and the Rt layer integrates and decodes motion information from multiple directions to determine the position of small objects. Extensive experiments on pigeon neurophysiological experiments and image sequence data showed that the TSOM is biologically interpretable and effective in extracting reliable small object motion features from complex high-altitude backgrounds.
- [395] arXiv:2404.00856 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Removing Speaker Information from Speech Representation using Variable-Length Soft PoolingSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
- [396] arXiv:2404.00862 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie EmbeddingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have demonstrated exceptional performance in various NLP applications. However, the majority of existing open-source LLMs are pre-trained primarily on English data and little part of other languages. This deficiency in multilingual training data results in suboptimal performance when applied to languages with fewer available resources. Furthermore, enhancing the performance of LLMs on low-resource languages by full-parameter fine-tuning with additional data requires substantial computational resources, posing computational barriers for research organizations and individual researchers. Consequently, several techniques such as parameter-efficient tuning and advanced embedding initialization have been proposed to address these challenges. In this work, we combine them to facilitate cross-lingual transfer on English-dominated open-source LLM. To effectively enhance the model's proficiency in Traditional Chinese, we conduct secondary pre-training on Llama 2 7B with Traditional Chinese data by leveraging QLoRA and our proposed zip-tie embedding initialization. The resulting model called Bailong, which stands for Bilingual trAnsfer learnIng based on qLOra and zip-tie embeddiNG. We present Bailong-instruct 7B, a fine-tuned version of Bailong 7B optimized for multi-turn dialogue scenarios. Recognizing the inadequacy of benchmark datasets in Traditional Chinese, we further introduce Bailong-bench to assess the alignment of models with human preferences and the capability to follow instructions in both Traditional Chinese and English tasks. In our evaluation, Bailong-instruct 7B exhibits competitive performance on Bailong-bench and other benchmark datasets when compared to other open-source models of similar or even larger parameter sizes. Bailong-instruct 7B and Bailong-bench are publicly available with the aim of empowering the community to build upon our efforts.
- [397] arXiv:2404.00884 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language ModelsWei He , Shichun Liu , Jun Zhao , Yiwen Ding , Yi Lu , Zhiheng Xi , Tao Gui , Qi Zhang , Xuanjing HuangComments: Accepted to NAACL 2024 FindingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have shown promising abilities of in-context learning (ICL), adapting swiftly to new tasks with only few-shot demonstrations. However, current few-shot methods heavily depend on high-quality, query-specific demos, which are often lacking. When faced with out-of-demonstration (OOD) queries, methods that rely on hand-crafted demos or external retrievers might fail. To bridge the gap between limited demos and OOD queries, we propose Self-Demos, a novel prompting method that elicits the inherent generalizability in LLMs by query-aware demo generation. The generated demos strategically interpolate between existing demos and the given query, transforming the query from OOD to ID. To evaluate the effectiveness of our approach, we manually constructed OOD-Toolset, a dataset in the tool-using scenario with over 300 real-world APIs and 1000 instances, each consisting of three tool-use cases as demos and an OOD query. Thorough experiments on our dataset and two public math benchmarks have shown that our method can outperform state-of-the-art baselines in the OOD setting. Moreover, we conduct a range of analyses to validate Self-Demos's generalization and provide more insights.
- [398] arXiv:2404.00897 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Machine Learning Robustness: A PrimerSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: This chapter explores the foundational concept of robustness in Machine Learning (ML) and its integral role in establishing trustworthiness in Artificial Intelligence (AI) systems. The discussion begins with a detailed definition of robustness, portraying it as the ability of ML models to maintain stable performance across varied and unexpected environmental conditions. ML robustness is dissected through several lenses: its complementarity with generalizability; its status as a requirement for trustworthy AI; its adversarial vs non-adversarial aspects; its quantitative metrics; and its indicators such as reproducibility and explainability. The chapter delves into the factors that impede robustness, such as data bias, model complexity, and the pitfalls of underspecified ML pipelines. It surveys key techniques for robustness assessment from a broad perspective, including adversarial attacks, encompassing both digital and physical realms. It covers non-adversarial data shifts and nuances of Deep Learning (DL) software testing methodologies. The discussion progresses to explore amelioration strategies for bolstering robustness, starting with data-centric approaches like debiasing and augmentation. Further examination includes a variety of model-centric methods such as transfer learning, adversarial training, and randomized smoothing. Lastly, post-training methods are discussed, including ensemble techniques, pruning, and model repairs, emerging as cost-effective strategies to make models more resilient against the unpredictable. This chapter underscores the ongoing challenges and limitations in estimating and achieving ML robustness by existing approaches. It offers insights and directions for future research on this crucial concept, as a prerequisite for trustworthy AI systems.
- [399] arXiv:2404.00903 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: Maximizing User Experience with LLMOps-Driven Personalized Recommendation SystemsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: The integration of LLMOps into personalized recommendation systems marks a significant advancement in managing LLM-driven applications. This innovation presents both opportunities and challenges for enterprises, requiring specialized teams to navigate the complexity of engineering technology while prioritizing data security and model interpretability. By leveraging LLMOps, enterprises can enhance the efficiency and reliability of large-scale machine learning models, driving personalized recommendations aligned with user preferences. Despite ethical considerations, LLMOps is poised for widespread adoption, promising more efficient and secure machine learning services that elevate user experience and shape the future of personalized recommendation systems.
- [400] arXiv:2404.00913 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LLaMA-Excitor: General Instruction Tuning via Indirect Feature InteractionComments: This paper is accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs. In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically, the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore, we unify the modeling of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning, we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.
- [401] arXiv:2404.00914 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Token-Efficient Leverage Learning in Large Language ModelsComments: 15 pages, 16 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have excelled in various tasks but perform better in high-resource scenarios, which presents challenges in low-resource scenarios. Data scarcity and the inherent difficulty of adapting LLMs to specific tasks compound the challenge. To address the twin hurdles, we introduce \textbf{Leverage Learning}. We present a streamlined implement of this methodology called Token-Efficient Leverage Learning (TELL). TELL showcases the potential of Leverage Learning, demonstrating effectiveness across various LLMs and low-resource tasks, ranging from $10^4$ to $10^6$ tokens. It reduces task data requirements by up to nearly an order of magnitude compared to conventional Supervised Fine-Tuning (SFT) while delivering competitive performance. With the same amount of task data, TELL leads in improving task performance compared to SFT. We discuss the mechanism of Leverage Learning, suggesting it aligns with quantization hypothesis and explore its promising potential through empirical testing.
- [402] arXiv:2404.00923 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial MeasurementsLisong C. Sun , Neel P. Bhatt , Jonathan C. Liu , Zhiwen Fan , Zhangyang Wang , Todd E. Humphreys , Ufuk TopcuComments: Project Webpage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map. Project Webpage: this https URL
- [403] arXiv:2404.00929 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Survey on Multilingual Large Language Models: Corpora, Alignment, and BiasSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Based on the foundation of Large Language Models (LLMs), Multilingual Large Language Models (MLLMs) have been developed to address the challenges of multilingual natural language processing tasks, hoping to achieve knowledge transfer from high-resource to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities. Secondly, we explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks that are crucial for enhancing the cross-lingual capability of MLLMs. Thirdly, we survey the existing studies on multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques. Finally, we discuss existing challenges and point out promising research directions. By demonstrating these aspects, this paper aims to facilitate a deeper understanding of MLLMs and their potentiality in various domains.
- [404] arXiv:2404.00942 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating the Factuality of Large Language Models using Large-Scale Knowledge GraphsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The advent of Large Language Models (LLMs) has significantly transformed the AI landscape, enhancing machine learning and AI capabilities. Factuality issue is a critical concern for LLMs, as they may generate factually incorrect responses. In this paper, we propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Specifically, the test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts. Unlike conventional methods that evaluate LLMs based on generated responses, GraphEval streamlines the evaluation process by creating a judge model to estimate the correctness of the answers given by the LLM. Our experiments demonstrate that the judge model's factuality assessment aligns closely with the correctness of the LLM's generated outputs, while also substantially reducing evaluation costs. Besides, our findings offer valuable insights into LLM performance across different metrics and highlight the potential for future improvements in ensuring the factual integrity of LLM outputs. The code is publicly available at this https URL .
- [405] arXiv:2404.00943 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evalverse: Unified and Accessible Library for Large Language Model EvaluationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces Evalverse, a novel library that streamlines the evaluation of Large Language Models (LLMs) by unifying disparate evaluation tools into a single, user-friendly framework. Evalverse enables individuals with limited knowledge of artificial intelligence to easily request LLM evaluations and receive detailed reports, facilitated by an integration with communication platforms like Slack. Thus, Evalverse serves as a powerful tool for the comprehensive assessment of LLMs, offering both researchers and practitioners a centralized and easily accessible evaluation framework. Finally, we also provide a demo video for Evalverse, showcasing its capabilities and implementation in a two-minute format.
- [406] arXiv:2404.00971 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Exploring and Evaluating Hallucinations in LLM-Powered Code GenerationSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The rise of Large Language Models (LLMs) has significantly advanced many applications on software engineering tasks, particularly in code generation. Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications. Existing work mainly focuses on investing the hallucination in the domain of natural language generation (NLG), leaving a gap in understanding the types and extent of hallucinations in the context of code generation. To bridge the gap, we conducted a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations present in it. Our study established a comprehensive taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness. Based on the results, we proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations. Hallucination recognition and mitigation experiments with HalluCode and HumanEval show existing LLMs face great challenges in recognizing hallucinations, particularly in identifying their types, and are hardly able to mitigate hallucinations. We believe our findings will shed light on future research about hallucination evaluation, detection, and mitigation, ultimately paving the way for building more effective and reliable code LLMs in the future.
- [407] arXiv:2404.00977 (cross-list from nlin.AO) [ pdf , ps , html , other ]
-
Title: Nonlinear Impulse Pattern Formulation dynamical social and political prediction algorithm for city planning and public participationSubjects: Adaptation and Self-Organizing Systems (nlin.AO) ; Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
Abstract: A nonlinear-dynamical algorithm for city planning is proposed as an Impulse Pattern Formulation (IPF) for predicting relevant parameters like health, artistic freedom, or financial developments of different social or political stakeholders over the cause of a planning process. The IPF has already shown high predictive precision at low computational cost in musical instrument simulations, brain dynamics, and human-human interactions. The social and political IPF consists of three basic equations of system state developments, self-adaptation of stakeholders, two adaptive interactions, and external impact terms suitable for respective planning situations. Typical scenarios of stakeholder interactions and developments are modeled by adjusting a set of system parameters. These include stakeholder reaction to external input, enhanced system stability through self-adaptation, stakeholder convergence due to mediative interaction adaptation, as well as complex dynamics in terms of direct stakeholder impacts. A workflow for implementing the algorithm in real city planning scenarios is outlined. This workflow includes machine learning of a suitable set of parameters suggesting best-practice planning to aim at the desired development of the planning process and its output.
- [408] arXiv:2404.00983 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Continual Learning for Smart City: A SurveyComments: Preprint. Work in ProgressSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: With the digitization of modern cities, large data volumes and powerful computational resources facilitate the rapid update of intelligent models deployed in smart cities. Continual learning (CL) is a novel machine learning paradigm that constantly updates models to adapt to changing environments, where the learning tasks, data, and distributions can vary over time. Our survey provides a comprehensive review of continual learning methods that are widely used in smart city development. The content consists of three parts: 1) Methodology-wise. We categorize a large number of basic CL methods and advanced CL frameworks in combination with other learning paradigms including graph learning, spatial-temporal learning, multi-modal learning, and federated learning. 2) Application-wise. We present numerous CL applications covering transportation, environment, public health, safety, networks, and associated datasets related to urban computing. 3) Challenges. We discuss current problems and challenges and envision several promising research directions. We believe this survey can help relevant researchers quickly familiarize themselves with the current state of continual learning research used in smart city development and direct them to future research trends.
- [409] arXiv:2404.00989 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: 360+x: A Panoptic Multi-modal Scene Understanding DatasetComments: CVPR 2024 (Oral Presentation), Project page: this https URLJournal-ref: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional binaural delay, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. Figure 1 offers a glimpse of all 28 scene categories of our 360+x dataset. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.
- [410] arXiv:2404.00998 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report GenerationComments: 11 pages, 6 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.
- [411] arXiv:2404.01012 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Query Performance Prediction using Relevance Judgments Generated by Large Language ModelsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of judging the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels; Also, this allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We judge relevance by leveraging a leading open-source large language model (LLM), LLaMA, to ensure scientific reproducibility. In doing so, we address two main challenges: (i) excessive computational costs of judging the entire corpus for predicting a recall-based metric, and (ii) poor performance in prompting LLaMA in a zero-/few-shot manner. We devise an approximation strategy to predict a recall-oriented IR measure and propose to fine-tune LLaMA using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP accuracy for both lexical and neural rankers in both precision- and recall-oriented metrics.
- [412] arXiv:2404.01013 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Anthropic Prior KnowledgeBo Zou , Shaofeng Wang , Hao Liu , Gaoyue Sun , Yajie Wang , FeiFei Zuo , Chengbin Quan , Youjian ZhaoComments: This paper has been accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Teeth localization, segmentation, and labeling in 2D images have great potential in modern dentistry to enhance dental diagnostics, treatment planning, and population-based studies on oral health. However, general instance segmentation frameworks are incompetent due to 1) the subtle differences between some teeth' shapes (e.g., maxillary first premolar and second premolar), 2) the teeth's position and shape variation across subjects, and 3) the presence of abnormalities in the dentition (e.g., caries and edentulism). To address these problems, we propose a ViT-based framework named TeethSEG, which consists of stacked Multi-Scale Aggregation (MSA) blocks and an Anthropic Prior Knowledge (APK) layer. Specifically, to compose the two modules, we design 1) a unique permutation-based upscaler to ensure high efficiency while establishing clear segmentation boundaries with 2) multi-head self/cross-gating layers to emphasize particular semantics meanwhile maintaining the divergence between token embeddings. Besides, we collect 3) the first open-sourced intraoral image dataset IO150K, which comprises over 150k intraoral photos, and all photos are annotated by orthodontists using a human-machine hybrid algorithm. Experiments on IO150K demonstrate that our TeethSEG outperforms the state-of-the-art segmentation models on dental image segmentation.
- [413] arXiv:2404.01019 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Source-Aware Training Enables Knowledge Attribution in Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge. We investigate the problem of intrinsic source citation, where LLMs are required to cite the pretraining source supporting a generated response. Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. To give LLMs such ability, we explore source-aware training -- a post pretraining recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning to teach the LLM to cite a supporting pretraining source when prompted. Source-aware training can easily be applied to pretrained LLMs off the shelf, and diverges minimally from existing pretraining/fine-tuning frameworks. Through experiments on carefully curated data, we demonstrate that our training recipe can enable faithful attribution to the pretraining data without a substantial impact on the model's quality compared to standard pretraining. Our results also highlight the importance of data augmentation in achieving attribution. Code and data available here: \url{ this https URL }
- [414] arXiv:2404.01030 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Survey of Bias In Text-to-Image Generation: Definition, Evaluation, and MitigationYixin Wan , Arjun Subramonian , Anaelia Ovalle , Zongyu Lin , Ashima Suvarna , Christina Chance , Hritik Bansal , Rebecca Pattichis , Kai-Wei ChangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The recent advancement of large and powerful models with Text-to-Image (T2I) generation abilities -- such as OpenAI's DALLE-3 and Google's Gemini -- enables users to generate high-quality images from textual prompts. However, it has become increasingly evident that even simple prompts could cause T2I models to exhibit conspicuous social bias in generated images. Such bias might lead to both allocational and representational harms in society, further marginalizing minority groups. Noting this problem, a large body of recent works has been dedicated to investigating different dimensions of bias in T2I systems. However, an extensive review of these studies is lacking, hindering a systematic understanding of current progress and research gaps. We present the first extensive survey on bias in T2I generative models. In this survey, we review prior studies on dimensions of bias: Gender, Skintone, and Geo-Culture. Specifically, we discuss how these works define, evaluate, and mitigate different aspects of bias. We found that: (1) while gender and skintone biases are widely studied, geo-cultural bias remains under-explored; (2) most works on gender and skintone bias investigated occupational association, while other aspects are less frequently studied; (3) almost all gender bias works overlook non-binary identities in their studies; (4) evaluation datasets and metrics are scattered, with no unified framework for measuring biases; and (5) current mitigation methods fail to resolve biases comprehensively. Based on current limitations, we point out future research directions that contribute to human-centric definitions, evaluations, and mitigation of biases. We hope to highlight the importance of studying biases in T2I systems, as well as encourage future efforts to holistically understand and tackle biases, building fair and trustworthy T2I technologies for everyone.
- [415] arXiv:2404.01036 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: Higher education assessment practice in the era of generative AI toolsComments: 11 pages, 7 tables published in the Journal of Applied Learning & TeachingJournal-ref: Higher education assessment practice in the era of generative AI tools. (2024). Journal of applied learning and teaching, 7(1)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: The higher education (HE) sector benefits every nation's economy and society at large. However, their contributions are challenged by advanced technologies like generative artificial intelligence (GenAI) tools. In this paper, we provide a comprehensive assessment of GenAI tools towards assessment and pedagogic practice and, subsequently, discuss the potential impacts. This study experimented using three assessment instruments from data science, data analytics, and construction management disciplines. Our findings are two-fold: first, the findings revealed that GenAI tools exhibit subject knowledge, problem-solving, analytical, critical thinking, and presentation skills and thus can limit learning when used unethically. Secondly, the design of the assessment of certain disciplines revealed the limitations of the GenAI tools. Based on our findings, we made recommendations on how AI tools can be utilised for teaching and learning in HE.
- [416] arXiv:2404.01041 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Can LLMs get help from other LLMs without revealing private information?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
Abstract: Cascades are a common type of machine learning systems in which a large, remote model can be queried if a local model is not able to accurately label a user's data by itself. Serving stacks for large language models (LLMs) increasingly use cascades due to their ability to preserve task performance while dramatically reducing inference costs. However, applying cascade systems in situations where the local model has access to sensitive data constitutes a significant privacy risk for users since such data could be forwarded to the remote model. In this work, we show the feasibility of applying cascade systems in such setups by equipping the local model with privacy-preserving techniques that reduce the risk of leaking private information when querying the remote model. To quantify information leakage in such setups, we introduce two privacy measures. We then propose a system that leverages the recently introduced social learning paradigm in which LLMs collaboratively learn from each other by exchanging natural language. Using this paradigm, we demonstrate on several datasets that our methods minimize the privacy loss while at the same time improving task performance compared to a non-cascade baseline.
- [417] arXiv:2404.01054 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model AlignmentSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate two variants of RBoN on the AlpacaFarm dataset and find that they outperform BoN, especially when the proxy reward model has a low correlation with the true objective.
- [418] arXiv:2404.01070 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Advancing AI with Integrity: Ethical Challenges and Solutions in Neural Machine TranslationComments: 11 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper addresses the ethical challenges of Artificial Intelligence in Neural Machine Translation (NMT) systems, emphasizing the imperative for developers to ensure fairness and cultural sensitivity. We investigate the ethical competence of AI models in NMT, examining the Ethical considerations at each stage of NMT development, including data handling, privacy, data ownership, and consent. We identify and address ethical issues through empirical studies. These include employing Transformer models for Luganda-English translations and enhancing efficiency with sentence mini-batching. And complementary studies that refine data labeling techniques and fine-tune BERT and Longformer models for analyzing Luganda and English social media content. Our second approach is a literature review from databases such as Google Scholar and platforms like GitHub. Additionally, the paper probes the distribution of responsibility between AI systems and humans, underscoring the essential role of human oversight in upholding NMT ethical standards. Incorporating a biblical perspective, we discuss the societal impact of NMT and the broader ethical responsibilities of developers, positing them as stewards accountable for the societal repercussions of their creations.
- [419] arXiv:2404.01084 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking PuzzlesComments: SemEval-2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we outline our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.
- [420] arXiv:2404.01089 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-OnComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image, which affects the try-on's efficiency and fidelity. To address these issues, we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results and introduces no additional image encoders. Accordingly, we make contributions from two aspects. First, we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. In addition, we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks, e.g., garment-to-person and person-to-person try-ons, and significantly outperforms state-of-the-art methods on popular VITON, VITON-HD databases.
- [421] arXiv:2404.01099 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: What's in Your "Safe" Data?: Identifying Benign Data that Breaks SafetySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Abstract: Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Furthermore, we propose a bi-directional anchoring method that prioritizes data points that are close to harmful examples and distant from benign ones. By doing so, our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after fine-tuning. Training on just 100 of these seemingly benign datapoints can lead to the fine-tuned model affirmatively responding to > 70% of tested harmful requests, compared to < 20% after fine-tuning on randomly selected data. We further find that selected data are often in the form of lists and bullet points, or math questions.
- [422] arXiv:2404.01109 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: An incremental hybrid adaptive network-based IDS in Software Defined Networks to detect stealth attacksSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Network attacks have became increasingly more sophisticated and stealthy due to the advances in technologies and the growing sophistication of attackers. Advanced Persistent Threats (APTs) are a type of attack that implement a wide range of strategies to evade detection and be under the defence radar. Software Defined Network (SDN) is a network paradigm that implements dynamic configuration by separating the control plane from the network plane. This approach improves security aspects by facilitating the employment of network intrusion detection systems. Implementing Machine Learning (ML) techniques in Intrusion Detection Systems (IDSs) is widely used to detect such attacks but has a challenge when the data distribution changes. Concept drift is a term that describes the change in the relationship between the input data and the target value (label or class). The model is expected to degrade as certain forms of change occur. In this paper, the primary form of change will be in user behaviour (particularly changes in attacker behaviour). It is essential for a model to adapt itself to deviations in data distribution. SDN can help in monitoring changes in data distribution. This paper discusses changes in stealth attacker behaviour. The work described here investigates various concept drift detection algorithms. An incremental hybrid adaptive Network Intrusion Detection System (NIDS) is proposed to tackle the issue of concept drift in SDN. It can detect known and unknown attacks. The model is evaluated over different datasets showing promising results.
- [423] arXiv:2404.01127 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Medical Visual Prompting (MVP): A Unified Framework for Versatile and High-Quality Medical Image SegmentationYulin Chen , Guoheng Huang , Kai Huang , Zijin Lin , Guo Zhong , Shenghong Luo , Jie Deng , Jian ZhouSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Accurate segmentation of lesion regions is crucial for clinical diagnosis and treatment across various diseases. While deep convolutional networks have achieved satisfactory results in medical image segmentation, they face challenges such as loss of lesion shape information due to continuous convolution and downsampling, as well as the high cost of manually labeling lesions with varying shapes and sizes. To address these issues, we propose a novel medical visual prompting (MVP) framework that leverages pre-training and prompting concepts from natural language processing (NLP). The framework utilizes three key components: Super-Pixel Guided Prompting (SPGP) for superpixelating the input image, Image Embedding Guided Prompting (IEGP) for freezing patch embedding and merging with superpixels to provide visual prompts, and Adaptive Attention Mechanism Guided Prompting (AAGP) for pinpointing prompt content and efficiently adapting all layers. By integrating SPGP, IEGP, and AAGP, the MVP enables the segmentation network to better learn shape prompting information and facilitates mutual learning across different tasks. Extensive experiments conducted on five datasets demonstrate superior performance of this method in various challenging medical image tasks, while simplifying single-task medical segmentation models. This novel framework offers improved performance with fewer parameters and holds significant potential for accurate segmentation of lesion regions in various medical tasks, making it clinically valuable.
- [424] arXiv:2404.01131 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: GOV-REK: Governed Reward Engineering Kernels for Designing Robust Multi-Agent Reinforcement Learning SystemsComments: Extended Abstract accepted in the 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2024)Subjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI)
Abstract: For multi-agent reinforcement learning systems (MARLS), the problem formulation generally involves investing massive reward engineering effort specific to a given problem. However, this effort often cannot be translated to other problems; worse, it gets wasted when system dynamics change drastically. This problem is further exacerbated in sparse reward scenarios, where a meaningful heuristic can assist in the policy convergence task. We propose GOVerned Reward Engineering Kernels (GOV-REK), which dynamically assign reward distributions to agents in MARLS during its learning stage. We also introduce governance kernels, which exploit the underlying structure in either state or joint action space for assigning meaningful agent reward distributions. During the agent learning stage, it iteratively explores different reward distribution configurations with a Hyperband-like algorithm to learn ideal agent reward models in a problem-agnostic manner. Our experiments demonstrate that our meaningful reward priors robustly jumpstart the learning process for effectively learning different MARL problems.
- [425] arXiv:2404.01135 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Enhancing Reasoning Capacity of SLM using Cognitive EnhancementSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have been applied to automate cyber security activities and processes including cyber investigation and digital forensics. However, the use of such models for cyber investigation and digital forensics should address accountability and security considerations. Accountability ensures models have the means to provide explainable reasonings and outcomes. This information can be extracted through explicit prompt requests. For security considerations, it is crucial to address privacy and confidentiality of the involved data during data processing as well. One approach to deal with this consideration is to have the data processed locally using a local instance of the model. Due to limitations of locally available resources, namely memory and GPU capacities, a Smaller Large Language Model (SLM) will typically be used. These SLMs have significantly fewer parameters compared to the LLMs. However, such size reductions have notable performance reduction, especially when tasked to provide reasoning explanations. In this paper, we aim to mitigate performance reduction through the integration of cognitive strategies that humans use for problem-solving. We term this as cognitive enhancement through prompts. Our experiments showed significant improvement gains of the SLMs' performances when such enhancements were applied. We believe that our exploration study paves the way for further investigation into the use of cognitive enhancement to optimize SLM for cyber security applications.
- [426] arXiv:2404.01143 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Condition-Aware Neural Network for Controlled Image GenerationComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step.
- [427] arXiv:2404.01154 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Uncovering the Text Embedding in Text-to-Image Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The correspondence between input text and the generated image exhibits opacity, wherein minor textual modifications can induce substantial deviations in the generated image. While, text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. In this paper, we address this research gap by delving into the text embedding space, unleashing its capacity for controllable image editing and explicable semantic direction attributes within a learning-free framework. Specifically, we identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding, providing instructive principles for learning-free image editing. Additionally, we find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition (SVD). These uncovered properties offer practical utility for image editing and semantic discovery. More importantly, we expect the in-depth analyses and findings of the text embedding can enhance the understanding of text-to-image diffusion models.
- [428] arXiv:2404.01156 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language PretrainingComments: CVPR2024 AcceptedSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However, in fashion domain, datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text, leading to cases where some textual details are not visible in individual images. This mismatch, particularly when non-co-occurring elements are masked, undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling, thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem, we propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model, ensuring a precise alignment between the two modalities. Additionally, we enhance grouped batch sampling with semi-hard negatives, effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach, outperforming existing methods in three downstream tasks.
- [429] arXiv:2404.01163 (cross-list from math.NA) [ pdf , ps , html , other ]
-
Title: Capturing Shock Waves by Relaxation Neural NetworksSubjects: Numerical Analysis (math.NA) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we put forward a neural network framework to solve the nonlinear hyperbolic systems. This framework, named relaxation neural networks(RelaxNN), is a simple and scalable extension of physics-informed neural networks(PINN). It is shown later that a typical PINN framework struggles to handle shock waves that arise in hyperbolic systems' solutions. This ultimately results in the failure of optimization that is based on gradient descent in the training process. Relaxation systems provide a smooth asymptotic to the discontinuity solution, under the expectation that macroscopic problems can be solved from a microscopic perspective. Based on relaxation systems, the RelaxNN framework alleviates the conflict of losses in the training process of the PINN framework. In addition to the remarkable results demonstrated in numerical simulations, most of the acceleration techniques and improvement strategies aimed at the standard PINN framework can also be applied to the RelaxNN framework.
- [430] arXiv:2404.01217 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Incorporating Domain Differential Equations into Graph Convolutional Networks to Lower Generalization DiscrepancySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Ensuring both accuracy and robustness in time series prediction is critical to many applications, ranging from urban planning to pandemic management. With sufficient training data where all spatiotemporal patterns are well-represented, existing deep-learning models can make reasonably accurate predictions. However, existing methods fail when the training data are drawn from different circumstances (e.g., traffic patterns on regular days) compared to test data (e.g., traffic patterns after a natural disaster). Such challenges are usually classified under domain generalization. In this work, we show that one way to address this challenge in the context of spatiotemporal prediction is by incorporating domain differential equations into Graph Convolutional Networks (GCNs). We theoretically derive conditions where GCNs incorporating such domain differential equations are robust to mismatched training and testing data compared to baseline domain agnostic models. To support our theory, we propose two domain-differential-equation-informed networks called Reaction-Diffusion Graph Convolutional Network (RDGCN), which incorporates differential equations for traffic speed evolution, and Susceptible-Infectious-Recovered Graph Convolutional Network (SIRGCN), which incorporates a disease propagation model. Both RDGCN and SIRGCN are based on reliable and interpretable domain differential equations that allow the models to generalize to unseen patterns. We experimentally show that RDGCN and SIRGCN are more robust with mismatched testing data than the state-of-the-art deep learning methods.
- [431] arXiv:2404.01223 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Feature Splatting: Language-Driven Physics-Based Scene Synthesis and EditingComments: Project website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: Scene representations using 3D Gaussian primitives have produced excellent results in modeling the appearance of static and dynamic 3D scenes. Many graphics applications, however, demand the ability to manipulate both the appearance and the physical properties of objects. We introduce Feature Splatting, an approach that unifies physics-based dynamic scene synthesis with rich semantics from vision language foundation models that are grounded by natural language. Our first contribution is a way to distill high-quality, object-centric vision-language features into 3D Gaussians, that enables semi-automatic scene decomposition using text queries. Our second contribution is a way to synthesize physics-based dynamics from an otherwise static scene using a particle-based simulator, in which material properties are assigned automatically via text queries. We ablate key techniques used in this pipeline, to illustrate the challenge and opportunities in using feature-carrying 3D Gaussians as a unified format for appearance, geometry, material properties and semantics grounded on natural language. Project website: this https URL
- [432] arXiv:2404.01258 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Direct Preference Optimization of Video Large Multimodal Models from Language Model RewardRuohong Zhang , Liangke Gui , Zhiqing Sun , Yihao Feng , Keyang Xu , Yuanhan Zhang , Di Fu , Chunyuan Li , Alexander Hauptmann , Yonatan Bisk , Yiming YangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.
- [433] arXiv:2404.01260 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Bridging Remote Sensors with Multisensor Geospatial Foundation ModelsComments: Accepted to CVPRSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities.
- [434] arXiv:2404.01261 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FABLES: Evaluating faithfulness and content selection in book-length summarizationYekyung Kim , Yapei Chang , Marzena Karpinska , Aparna Garimella , Varun Manjunatha , Kyle Lo , Tanya Goyal , Mohit IyyerComments: preprint - 39 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: Claude-3-Opus significantly outperforms all closed-source LLMs, while the open-source Mixtral is on par with GPT-3.5-Turbo. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book.
- [435] arXiv:2404.01268 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mapping the Increasing Use of LLMs in Scientific PapersWeixin Liang , Yaohui Zhang , Zhengxuan Wu , Haley Lepp , Wenlong Ji , Xuandong Zhao , Hancheng Cao , Sheng Liu , Siyu He , Zhi Huang , Diyi Yang , Christopher Potts , Christopher D Manning , James Y. ZouSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.
- [436] arXiv:2404.01291 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Evaluating Text-to-Visual Generation with Image-to-Text GenerationZhiqiu Lin , Deepak Pathak , Baiqi Li , Jiayao Li , Xide Xia , Graham Neubig , Pengchuan Zhang , Deva RamananComments: We open-source our data, model, and code at: this https URL ; Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
- [437] arXiv:2404.01295 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Safety and Helpfulness Balanced Responses via Controllable Large Language ModelsYi-Lin Tuan , Xilun Chen , Eric Michael Smith , Louis Martin , Soumya Batra , Asli Celikyilmaz , William Yang Wang , Daniel M. BikelSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: As large language models (LLMs) become easily accessible nowadays, the trade-off between safety and helpfulness can significantly impact user experience. A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. Possible harms include teaching people how to build a bomb, exposing youth to inappropriate content, and hurting users' mental health. In this work, we propose to balance safety and helpfulness in diverse use cases by controlling both attributes in LLM. We explore training-free and fine-tuning methods that do not require extra human annotations and analyze the challenges of controlling safety and helpfulness in LLMs. Our experiments demonstrate that our method can rewind a learned model and unlock its controllability.
- [438] arXiv:2404.01299 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual ScenesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning analysis. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry" cartoon series. With thoughtful questions and multi-level answers, our dataset contains much longer causal chains embedded in dynamic interactions and visuals, at the same time principles of animation allows animators to create well-defined, unambiguous causal relationships. These factors allow models to solve more challenging, yet well-defined causal relationships. We also introduce hard negative mining, including CausalConfusion version. While models perform well, there is much room for improvement, especially, on open-ended answers. We identify more advanced/explicit causal relationship modeling and joint modeling of vision and language as the immediate areas for future efforts to focus upon. Along with the other complementary datasets, our new challenging dataset will pave the way for these developments in the field. We will release our dataset, codes, and models to help future efforts in this domain.
- [439] arXiv:2404.01300 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance FieldsMuhammad Zubair Irshad , Sergey Zakahrov , Vitor Guizilini , Adrien Gaidon , Zsolt Kira , Rares AmbrusComments: 29 pages, 13 figures. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.6 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.
- [440] arXiv:2404.01317 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in TransformersSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Pretraining language models on large text corpora is a common practice in natural language processing. Fine-tuning of these models is then performed to achieve the best results on a variety of tasks. In this paper, we investigate the problem of catastrophic forgetting in transformer neural networks and question the common practice of fine-tuning with a flat learning rate for the entire network in this context. We perform a hyperparameter optimization process to find learning rate distributions that are better than a flat learning rate. We combine the learning rate distributions thus found and show that they generalize to better performance with respect to the problem of catastrophic forgetting. We validate these learning rate distributions with a variety of NLP benchmarks from the GLUE dataset.
- [441] arXiv:2404.01319 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Information Cascade Prediction under Public Emergencies: A SurveySubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: With the advent of the era of big data, massive information, expert experience, and high-accuracy models bring great opportunities to the information cascade prediction of public emergencies. However, the involvement of specialist knowledge from various disciplines has resulted in a primarily application-specific focus (e.g., earthquakes, floods, infectious diseases) for information cascade prediction of public emergencies. The lack of a unified prediction framework poses a challenge for classifying intersectional prediction methods across different application fields. This survey paper offers a systematic classification and summary of information cascade modeling, prediction, and application. We aim to help researchers identify cutting-edge research and comprehend models and methods of information cascade prediction under public emergencies. By summarizing open issues and outlining future directions in this field, this paper has the potential to be a valuable resource for researchers conducting further studies on predicting information cascades.
- [442] arXiv:2404.01320 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Graph-Based Optimisation of Network Expansion in a Dockless Bike Sharing SystemComments: Accepted to publish in The 2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW&DASC-2024), pp. 1-8Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Bike-sharing systems (BSSs) are deployed in over a thousand cities worldwide and play an important role in many urban transportation systems. BSSs alleviate congestion, reduce pollution and promote physical exercise. It is essential to explore the spatiotemporal patterns of bike-sharing demand, as well as the factors that influence these patterns, in order to optimise system operational efficiency. In this study, an optimised geo-temporal graph is constructed using trip data from Moby Bikes, a dockless BSS operator. The process of optimising the graph unveiled prime locations for erecting new stations during future expansions of the BSS. The Louvain algorithm, a community detection technique, is employed to uncover usage patterns at different levels of temporal granularity. The community detection results reveal largely self-contained sub-networks that exhibit similar usage patterns at their respective levels of temporal granularity. Overall, this study reinforces that BSSs are intrinsically spatiotemporal systems, with community presence driven by spatiotemporal dynamics. These findings may aid operators in improving redistribution efficiency.
- [443] arXiv:2404.01322 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Review of Multi-Modal Large Language and Vision ModelsComments: 33 pages, 1 figureSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens up applications like text-to-video generation, image captioning, text-to-speech, and more and is achieved either by retro-fitting an LLM with multi-modal capabilities, or building a MM-LLM from scratch. This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs. It covers the historical development of LLMs especially the advances enabled by transformer-based architectures like OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in enhancing model performance. The paper includes coverage of the major and most important of the LLMs and MM-LLMs and also covers the techniques of model tuning, including fine-tuning and prompt engineering, which tailor pre-trained models to specific tasks or domains. Ethical considerations and challenges, such as data bias and model misuse, are also analysed to underscore the importance of responsible AI development and deployment. Finally, we discuss the implications of open-source versus proprietary models in AI research. Through this review, we provide insights into the transformative potential of MM-LLMs in various applications.
- [444] arXiv:2404.01327 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Entertainment chatbot for the digital inclusion of elderly people without abstraction capabilitiesSilvia García-Méndez , Francisco de Arriba-Pérez , Francisco J. González-Castaño , José A. Regueiro-Janeiro , Felipe Gil-CastiñeiraSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Current language processing technologies allow the creation of conversational chatbot platforms. Even though artificial intelligence is still too immature to support satisfactory user experience in many mass market domains, conversational interfaces have found their way into ad hoc applications such as call centres and online shopping assistants. However, they have not been applied so far to social inclusion of elderly people, who are particularly vulnerable to the digital divide. Many of them relieve their loneliness with traditional media such as TV and radio, which are known to create a feeling of companionship. In this paper we present the EBER chatbot, designed to reduce the digital gap for the elderly. EBER reads news in the background and adapts its responses to the user's mood. Its novelty lies in the concept of "intelligent radio", according to which, instead of simplifying a digital information system to make it accessible to the elderly, a traditional channel they find familiar -- background news -- is augmented with interactions via voice dialogues. We make it possible by combining Artificial Intelligence Modelling Language, automatic Natural Language Generation and Sentiment Analysis. The system allows accessing digital content of interest by combining words extracted from user answers to chatbot questions with keywords extracted from the news items. This approach permits defining metrics of the abstraction capabilities of the users depending on a spatial representation of the word space. To prove the suitability of the proposed solution we present results of real experiments conducted with elderly people that provided valuable insights. Our approach was considered satisfactory during the tests and improved the information search capabilities of the participants.
- [445] arXiv:2404.01331 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language ModelComments: Authors 1 and 2 contributed equally. Models available at this https URL and \url{ this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.
- [446] arXiv:2404.01332 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Wait, It's All Token Noise? Always Has Been: Interpreting LLM Behavior Using Shapley ValueSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The emergence of large language models (LLMs) has opened up exciting possibilities for simulating human behavior and cognitive processes, with potential applications in various domains, including marketing research and consumer behavior analysis. However, the validity of utilizing LLMs as stand-ins for human subjects remains uncertain due to glaring divergences that suggest fundamentally different underlying processes at play and the sensitivity of LLM responses to prompt variations. This paper presents a novel approach based on Shapley values from cooperative game theory to interpret LLM behavior and quantify the relative contribution of each prompt component to the model's output. Through two applications-a discrete choice experiment and an investigation of cognitive biases-we demonstrate how the Shapley value method can uncover what we term "token noise" effects, a phenomenon where LLM decisions are disproportionately influenced by tokens providing minimal informative content. This phenomenon raises concerns about the robustness and generalizability of insights obtained from LLMs in the context of human behavior simulation. Our model-agnostic approach extends its utility to proprietary LLMs, providing a valuable tool for marketers and researchers to strategically optimize prompts and mitigate apparent cognitive biases. Our findings underscore the need for a more nuanced understanding of the factors driving LLM responses before relying on them as substitutes for human subjects in research settings. We emphasize the importance of researchers reporting results conditioned on specific prompt templates and exercising caution when drawing parallels between human behavior and LLMs.
- [447] arXiv:2404.01335 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative AI for Architectural Design: A Literature ReviewComments: 32 pages, 20 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Generative Artificial Intelligence (AI) has pioneered new methodological paradigms in architectural design, significantly expanding the innovative potential and efficiency of the design process. This paper explores the extensive applications of generative AI technologies in architectural design, a trend that has benefited from the rapid development of deep generative models. This article provides a comprehensive review of the basic principles of generative AI and large-scale models and highlights the applications in the generation of 2D images, videos, and 3D models. In addition, by reviewing the latest literature from 2020, this paper scrutinizes the impact of generative AI technologies at different stages of architectural design, from generating initial architectural 3D forms to producing final architectural imagery. The marked trend of research growth indicates an increasing inclination within the architectural design community towards embracing generative AI, thereby catalyzing a shared enthusiasm for research. These research cases and methodologies have not only proven to enhance efficiency and innovation significantly but have also posed challenges to the conventional boundaries of architectural creativity. Finally, we point out new directions for design innovation and articulate fresh trajectories for applying generative AI in the architectural domain. This article provides the first comprehensive literature review about generative AI for architectural design, and we believe this work can facilitate more research work on this significant topic in architecture.
- [448] arXiv:2404.01336 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Existing benchmarks for fake news detection have significantly contributed to the advancement of models in assessing the authenticity of news content. However, these benchmarks typically focus solely on news pertaining to a single semantic topic or originating from a single platform, thereby failing to capture the diversity of multi-domain news in real scenarios. In order to understand fake news across various domains, the external knowledge and fine-grained annotations are indispensable to provide precise evidence and uncover the diverse underlying strategies for fabrication, which are also ignored by existing benchmarks. To address this gap, we introduce a novel multi-domain knowledge-enhanced benchmark with fine-grained annotations, named \textbf{FineFake}. FineFake encompasses 16,909 data samples spanning six semantic topics and eight platforms. Each news item is enriched with multi-modal content, potential social context, semi-manually verified common knowledge, and fine-grained annotations that surpass conventional binary labels. Furthermore, we formulate three challenging tasks based on FineFake and propose a knowledge-enhanced domain adaptation network. Extensive experiments are conducted on FineFake under various scenarios, providing accurate and reliable benchmarks for future endeavors. The entire FineFake project is publicly accessible as an open-source repository at \url{ this https URL }.
- [449] arXiv:2404.01339 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Humane Speech Synthesis through Zero-Shot Emotion and Disfluency GenerationComments: 10 pages, 1 figure, for associated code and media files, see this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Contemporary conversational systems often present a significant limitation: their responses lack the emotional depth and disfluent characteristic of human interactions. This absence becomes particularly noticeable when users seek more personalized and empathetic interactions. Consequently, this makes them seem mechanical and less relatable to human users. Recognizing this gap, we embarked on a journey to humanize machine communication, to ensure AI systems not only comprehend but also resonate. To address this shortcoming, we have designed an innovative speech synthesis pipeline. Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting. These intricacies are seamlessly integrated into the generated text by the language model during text generation, allowing the system to mirror human speech patterns better, promoting more intuitive and natural user interactions. These generated elements are then adeptly transformed into corresponding speech patterns and emotive sounds using a rule-based approach during the text-to-speech phase. Based on our experiments, our novel system produces synthesized speech that's almost indistinguishable from genuine human communication, making each interaction feel more personal and authentic.
- [450] arXiv:2404.01340 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: From Similarity to Superiority: Channel Clustering for Time Series ForecastingJialin Chen , Jan Eric Lenssen , Aosong Feng , Weihua Hu , Matthias Fey , Leandros Tassiulas , Jure Leskovec , Rex YingComments: 20 pages, 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Time series forecasting has attracted significant attention in recent decades. Previous studies have demonstrated that the Channel-Independent (CI) strategy improves forecasting performance by treating different channels individually, while it leads to poor generalization on unseen instances and ignores potentially necessary interactions between channels. Conversely, the Channel-Dependent (CD) strategy mixes all channels with even irrelevant and indiscriminate information, which, however, results in oversmoothing issues and limits forecasting accuracy. There is a lack of channel strategy that effectively balances individual channel treatment for improved forecasting performance without overlooking essential interactions between channels. Motivated by our observation of a correlation between the time series model's performance boost against channel mixing and the intrinsic similarity on a pair of channels, we developed a novel and adaptable Channel Clustering Module (CCM). CCM dynamically groups channels characterized by intrinsic similarities and leverages cluster identity instead of channel identity, combining the best of CD and CI worlds. Extensive experiments on real-world datasets demonstrate that CCM can (1) boost the performance of CI and CD models by an average margin of 2.4% and 7.2% on long-term and short-term forecasting, respectively; (2) enable zero-shot forecasting with mainstream time series forecasting models; (3) uncover intrinsic time series patterns among channels and improve interpretability of complex time series models.
- [451] arXiv:2404.01341 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Block-Diagonal Guided DBSCAN ClusteringComments: arXiv admin note: text overlap with arXiv:2009.04552 by other authorsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
Abstract: Cluster analysis plays a crucial role in database mining, and one of the most widely used algorithms in this field is DBSCAN. However, DBSCAN has several limitations, such as difficulty in handling high-dimensional large-scale data, sensitivity to input parameters, and lack of robustness in producing clustering results. This paper introduces an improved version of DBSCAN that leverages the block-diagonal property of the similarity graph to guide the clustering procedure of DBSCAN. The key idea is to construct a graph that measures the similarity between high-dimensional large-scale data points and has the potential to be transformed into a block-diagonal form through an unknown permutation, followed by a cluster-ordering procedure to generate the desired permutation. The clustering structure can be easily determined by identifying the diagonal blocks in the permuted graph. We propose a gradient descent-based method to solve the proposed problem. Additionally, we develop a DBSCAN-based points traversal algorithm that identifies clusters with high densities in the graph and generates an augmented ordering of clusters. The block-diagonal structure of the graph is then achieved through permutation based on the traversal order, providing a flexible foundation for both automatic and interactive cluster analysis. We introduce a split-and-refine algorithm to automatically search for all diagonal blocks in the permuted graph with theoretically optimal guarantees under specific cases. We extensively evaluate our proposed approach on twelve challenging real-world benchmark clustering datasets and demonstrate its superior performance compared to the state-of-the-art clustering method on every dataset.
- [452] arXiv:2404.01342 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language ModelComments: Published as a conference paper at CVPR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process that typically requires numerous trials. Drawing inspiration from the tool usage research of large language models (LLMs), we introduce DiffAgent, an LLM agent designed to screen the accurate selection in seconds via API calls. DiffAgent leverages a novel two-stage training framework, SFTA, enabling it to accurately align T2I API responses with user input in accordance with human preferences. To train and evaluate DiffAgent's capabilities, we present DABench, a comprehensive dataset encompassing an extensive range of T2I APIs from the community. Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework. Codes are available at this https URL .
- [453] arXiv:2404.01343 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CHOPS: CHat with custOmer Profile Systems for Customer Service with LLMsComments: 14 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Businesses and software platforms are increasingly turning to Large Language Models (LLMs) such as GPT-3.5, GPT-4, GLM-3, and LLaMa-2 for chat assistance with file access or as reasoning agents for customer service. However, current LLM-based customer service models have limited integration with customer profiles and lack the operational capabilities necessary for effective service. Moreover, existing API integrations emphasize diversity over the precision and error avoidance essential in real-world customer service scenarios. To address these issues, we propose an LLM agent named CHOPS (CHat with custOmer Profile in existing System), designed to: (1) efficiently utilize existing databases or systems for accessing user information or interacting with these systems following existing guidelines; (2) provide accurate and reasonable responses or carry out required operations in the system while avoiding harmful operations; and (3) leverage a combination of small and large LLMs to achieve satisfying performance at a reasonable inference cost. We introduce a practical dataset, the CPHOS-dataset, which includes a database, guiding files, and QA pairs collected from CPHOS, an online platform that facilitates the organization of simulated Physics Olympiads for high school teachers and students. We have conducted extensive experiments to validate the performance of our proposed CHOPS architecture using the CPHOS-dataset, with the aim of demonstrating how LLMs can enhance or serve as alternatives to human customer service. Code for our proposed architecture and dataset can be found at { this https URL }.
- [454] arXiv:2404.01349 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Fairness in Large Language Models: A Taxonomic SurveySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across various domains. However, despite their promising performance in numerous real-world applications, most of these algorithms lack fairness considerations. Consequently, they may lead to discriminatory outcomes against certain communities, particularly marginalized populations, prompting extensive study in fair LLMs. On the other hand, fairness in LLMs, in contrast to fairness in traditional machine learning, entails exclusive backgrounds, taxonomies, and fulfillment techniques. To this end, this survey presents a comprehensive overview of recent advances in the existing literature concerning fair LLMs. Specifically, a brief introduction to LLMs is provided, followed by an analysis of factors contributing to bias in LLMs. Additionally, the concept of fairness in LLMs is discussed categorically, summarizing metrics for evaluating bias in LLMs and existing algorithms for promoting fairness. Furthermore, resources for evaluating bias in LLMs, including toolkits and datasets, are summarized. Finally, existing research challenges and open questions are discussed.
- [455] arXiv:2404.01351 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: AETTA: Label-Free Accuracy Estimation for Test-Time AdaptationComments: Accepted to CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However, TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context, such as requiring labeled data or re-training models. To address this issue, we propose AETTA, a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate, calculated by comparing the target model prediction with dropout inferences. We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines. We further demonstrate the effectiveness of accuracy estimation with a model recovery case study, showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at this https URL .
- [456] arXiv:2404.01352 (cross-list from physics.flu-dyn) [ pdf , ps , html , other ]
-
Title: VortexViz: Finding Vortex Boundaries by Learning from Particle TrajectoriesAkila de Silva , Nicholas Tee , Omkar Ghanekar , Fahim Hasan Khan , Gregory Dusek , James Davis , Alex PangComments: Under reviewSubjects: Fluid Dynamics (physics.flu-dyn) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Abstract: Vortices are studied in various scientific disciplines, offering insights into fluid flow behavior. Visualizing the boundary of vortices is crucial for understanding flow phenomena and detecting flow irregularities. This paper addresses the challenge of accurately extracting vortex boundaries using deep learning techniques. While existing methods primarily train on velocity components, we propose a novel approach incorporating particle trajectories (streamlines or pathlines) into the learning process. By leveraging the regional/local characteristics of the flow field captured by streamlines or pathlines, our methodology aims to enhance the accuracy of vortex boundary extraction.
- [457] arXiv:2404.01353 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficiently Distilling LLMs for Edge ApplicationsComments: This paper has been accepted for publication in NAACL 2024 (Industry Track)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.
- [458] arXiv:2404.01356 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The Double-Edged Sword of Input Perturbations to Robust Accurate FairnessSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Deep neural networks (DNNs) are known to be sensitive to adversarial input perturbations, leading to a reduction in either prediction accuracy or individual fairness. To jointly characterize the susceptibility of prediction accuracy and individual fairness to adversarial perturbations, we introduce a novel robustness definition termed robust accurate fairness. Informally, robust accurate fairness requires that predictions for an instance and its similar counterparts consistently align with the ground truth when subjected to input perturbations. We propose an adversarial attack approach dubbed RAFair to expose false or biased adversarial defects in DNN, which either deceive accuracy or compromise individual fairness. Then, we show that such adversarial instances can be effectively addressed by carefully designed benign perturbations, correcting their predictions to be accurate and fair. Our work explores the double-edged sword of input perturbations to robust accurate fairness in DNN and the potential of using benign perturbations to correct adversarial instances.
- [459] arXiv:2404.01358 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: Utilizing AI and Social Media Analytics to Discover Adverse Side Effects of GLP-1 Receptor AgonistsComments: 19 pages, 7 figures, 3 tables, 1 Appendix tableSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Adverse side effects (ASEs) of drugs, revealed after FDA approval, pose a threat to patient safety. To promptly detect overlooked ASEs, we developed a digital health methodology capable of analyzing massive public data from social media, published clinical research, manufacturers' reports, and ChatGPT. We uncovered ASEs associated with the glucagon-like peptide 1 receptor agonists (GLP-1 RA), a market expected to grow exponentially to $133.5 billion USD by 2030. Using a Named Entity Recognition (NER) model, our method successfully detected 21 potential ASEs overlooked upon FDA approval, including irritability and numbness. Our data-analytic approach revolutionizes the detection of unreported ASEs associated with newly deployed drugs, leveraging cutting-edge AI-driven social media analytics. It can increase the safety of new drugs in the marketplace by unlocking the power of social media to support regulators and manufacturers in the rapid discovery of hidden ASE risks.
- [460] arXiv:2404.01359 (cross-list from quant-ph) [ pdf , ps , other ]
-
Title: Parallel Proportional Fusion of Spiking Quantum Neural Network for Optimizing Image ClassificationZuyu Xu , Kang Shen , Pengnian Cai , Tao Yang , Yuanming Hu , Shixian Chen , Yunlai Zhu , Zuheng Wu , Yuehua Dai , Jun Wang , Fei YangSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: The recent emergence of the hybrid quantum-classical neural network (HQCNN) architecture has garnered considerable attention due to the potential advantages associated with integrating quantum principles to enhance various facets of machine learning algorithms and computations. However, the current investigated serial structure of HQCNN, wherein information sequentially passes from one network to another, often imposes limitations on the trainability and expressivity of the network. In this study, we introduce a novel architecture termed Parallel Proportional Fusion of Quantum and Spiking Neural Networks (PPF-QSNN). The dataset information is simultaneously fed into both the spiking neural network and the variational quantum circuits, with the outputs amalgamated in proportion to their individual contributions. We systematically assess the impact of diverse PPF-QSNN parameters on network performance for image classification, aiming to identify the optimal configuration. Numerical results on the MNIST dataset unequivocally illustrate that our proposed PPF-QSNN outperforms both the existing spiking neural network and the serial quantum neural network across metrics such as accuracy, loss, and robustness. This study introduces a novel and effective amalgamation approach for HQCNN, thereby laying the groundwork for the advancement and application of quantum advantage in artificial intelligent computations.
- [461] arXiv:2404.01361 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLM Attributor: Interactive Visual Attribution for LLM GenerationSeongmin Lee , Zijie J. Wang , Aishwarya Chakravarthy , Alec Helbling , ShengYun Peng , Mansi Phute , Duen Horng Chau , Minsuk KahngComments: 8 pages, 3 figures, For a video demo, see this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: While large language models (LLMs) have shown remarkable capability to generate convincing text across diverse domains, concerns around its potential risks have highlighted the importance of understanding the rationale behind text generation. We present LLM Attributor, a Python library that provides interactive visualizations for training data attribution of an LLM's text generation. Our library offers a new way to quickly attribute an LLM's text generation to training data points to inspect model behaviors, enhance its trustworthiness, and compare model-generated text with user-provided text. We describe the visual and interactive design of our tool and highlight usage scenarios for LLaMA2 models fine-tuned with two different datasets: online articles about recent disasters and finance-related question-answer pairs. Thanks to LLM Attributor's broad support for computational notebooks, users can easily integrate it into their workflow to interactively visualize attributions of their models. For easier access and extensibility, we open-source LLM Attributor at this https URL LLM-Attribution. The video demo is available at this https URL .
- [462] arXiv:2404.01363 (cross-list from cs.OS) [ pdf , ps , html , other ]
-
Title: AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature ReviewSubjects: Operating Systems (cs.OS) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.
- [463] arXiv:2404.01364 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Information Plane Analysis Visualization in Deep Learning via Transfer EntropyJournal-ref: 2023 27th International Conference Information Visualisation (IV), pages 278-285Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
Abstract: In a feedforward network, Transfer Entropy (TE) can be used to measure the influence that one layer has on another by quantifying the information transfer between them during training. According to the Information Bottleneck principle, a neural model's internal representation should compress the input data as much as possible while still retaining sufficient information about the output. Information Plane analysis is a visualization technique used to understand the trade-off between compression and information preservation in the context of the Information Bottleneck method by plotting the amount of information in the input data against the compressed representation. The claim that there is a causal link between information-theoretic compression and generalization, measured by mutual information, is plausible, but results from different studies are conflicting. In contrast to mutual information, TE can capture temporal relationships between variables. To explore such links, in our novel approach we use TE to quantify information transfer between neural layers and perform Information Plane analysis. We obtained encouraging experimental results, opening the possibility for further investigations.
- [464] arXiv:2404.01365 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Prompt-prompted Mixture of Experts for Efficient LLM GenerationComments: Revision 1: Updated abstract with code link; re-ran top-k + sampling rows in Table 4, conclusions unchangedSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free MoE that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.25$\times$ speed-up in Llama 2 13B on an NVIDIA L40). Code is available at this https URL .
- [465] arXiv:2404.01397 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Object-conditioned Bag of Instances for Few-Shot Personalized Instance RecognitionComments: ICASSP 2024. Copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in otherSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Nowadays, users demand for increased personalization of vision systems to localize and identify personal instances of objects (e.g., my dog rather than dog) from a few-shot dataset only. Despite outstanding results of deep networks on classical label-abundant benchmarks (e.g., those of the latest YOLOv8 model for standard object detection), they struggle to maintain within-class variability to represent different instances rather than object categories only. We construct an Object-conditioned Bag of Instances (OBoI) based on multi-order statistics of extracted features, where generic object detection models are extended to search and identify personal instances from the OBoI's metric space, without need for backpropagation. By relying on multi-order statistics, OBoI achieves consistent superior accuracy in distinguishing different instances. In the results, we achieve 77.1% personal object recognition accuracy in case of 18 personal instances, showing about 12% relative gain over the state of the art.
- [466] arXiv:2404.01402 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: ContactHandover: Contact-Guided Robot-to-Human Object HandoverComments: Project website: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Robot-to-human object handover is an important step in many human robot collaboration tasks. A successful handover requires the robot to maintain a stable grasp on the object while making sure the human receives the object in a natural and easy-to-use manner. We propose ContactHandover, a robot to human handover system that consists of two phases: a contact-guided grasping phase and an object delivery phase. During the grasping phase, ContactHandover predicts both 6-DoF robot grasp poses and a 3D affordance map of human contact points on the object. The robot grasp poses are reranked by penalizing those that block human contact points, and the robot executes the highest ranking grasp. During the delivery phase, the robot end effector pose is computed by maximizing human contact points close to the human while minimizing the human arm joint torques and displacements. We evaluate our system on 27 diverse household objects and show that our system achieves better visibility and reachability of human contacts to the receiver compared to several baselines. More results can be found on this https URL
- [467] arXiv:2404.01409 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual RepresentationComments: CVPR 2024; 12 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: In the realm of food computing, segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients, the emergence of new ingredients, and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients, particularly new and diverse ones. In response to these limitations, we introduce OVFoodSeg, a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs), our approach enriches text embedding with image-specific information through two innovative modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset, setting a new milestone for food image segmentation.
- [468] arXiv:2404.01413 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic DataMatthias Gerstgrasser , Rylan Schaeffer , Apratim Dey , Rafael Rafailov , Henry Sleight , John Hughes , Tomasz Korbak , Rajashree Agrawal , Dhruv Pai , Andrey Gromov , Daniel A. Roberts , Diyi Yang , David L. Donoho , Sanmi KoyejoSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.
- [469] arXiv:2404.01430 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Position-Aware Parameter Efficient Fine-Tuning Approach for Reducing Positional Bias in LLMsZheng Zhang , Fan Yang , Ziyan Jiang , Zheng Chen , Zhengyang Zhao , Chengyuan Ma , Liang Zhao , Yang LiuSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent advances in large language models (LLMs) have enhanced their ability to process long input contexts. This development is particularly crucial for tasks that involve retrieving knowledge from an external datastore, which can result in long inputs. However, recent studies show a positional bias in LLMs, demonstrating varying performance depending on the location of useful information within the input sequence. In this study, we conduct extensive experiments to investigate the root causes of positional bias. Our findings indicate that the primary contributor to LLM positional bias stems from the inherent positional preferences of different models. We demonstrate that merely employing prompt-based solutions is inadequate for overcoming the positional preferences. To address this positional bias issue of a pre-trained LLM, we developed a Position-Aware Parameter Efficient Fine-Tuning (PAPEFT) approach which is composed of a data augmentation technique and a parameter efficient adapter, enhancing a uniform attention distribution across the input context. Our experiments demonstrate that the proposed approach effectively reduces positional bias, improving LLMs' effectiveness in handling long context sequences for various tasks that require externally retrieved knowledge.
- [470] arXiv:2404.01438 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Generation and Detection of Sign Language Deepfakes -- A Linguistic and Visual AnalysisShahzeb Naeem , Muhammad Riyyan Khan , Usman Tariq , Abhinav Dhall , Carlos Ivan Colon , Hasan Al-NashashComments: 13 pages, 13 figures, Computer Vision and Image Understanding JournalSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: A question in the realm of deepfakes is slowly emerging pertaining to whether we can go beyond facial deepfakes and whether it would be beneficial to society. Therefore, this research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community. The resulting videos are later vetted with a sign language expert. This is particularly helpful, given the intricate nature of sign language, a scarcity of sign language experts, and potential benefits for health and education. The objectives of this work encompass constructing a reliable deepfake dataset, evaluating its technical and visual credibility through computer vision and natural language processing models, and assessing the plausibility of the generated content. With over 1200 videos, featuring both previously seen and unseen individuals for the generation model, using the help of a sign language expert, we establish a deepfake dataset in sign language that can further be utilized to detect fake videos that may target certain people of determination.
- [471] arXiv:2404.01439 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Creating emoji lexica from unsupervised sentiment analysis of their descriptionsMilagros Fernández-Gavilanes , Jonathan Juncal-Martínez , Silvia García-Méndez , Enrique Costa-Montenegro , Francisco Javier González-CastañoSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Online media, such as blogs and social networking sites, generate massive volumes of unstructured data of great interest to analyze the opinions and sentiments of individuals and organizations. Novel approaches beyond Natural Language Processing are necessary to quantify these opinions with polarity metrics. So far, the sentiment expressed by emojis has received little attention. The use of symbols, however, has boomed in the past four years. About twenty billion are typed in Twitter nowadays, and new emojis keep appearing in each new Unicode version, making them increasingly relevant to sentiment analysis tasks. This has motivated us to propose a novel approach to predict the sentiments expressed by emojis in online textual messages, such as tweets, that does not require human effort to manually annotate data and saves valuable time for other analysis tasks. For this purpose, we automatically constructed a novel emoji sentiment lexicon using an unsupervised sentiment analysis system based on the definitions given by emoji creators in Emojipedia. Additionally, we automatically created lexicon variants by also considering the sentiment distribution of the informal texts accompanying emojis. All these lexica are evaluated and compared regarding the improvement obtained by including them in sentiment analysis of the annotated datasets provided by Kralj Novak et al. (2015). The results confirm the competitiveness of our approach.
- [472] arXiv:2404.01440 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Neural Implicit Representation for Building Digital Twins of Unknown Articulated ObjectsYijia Weng , Bowen Wen , Jonathan Tremblay , Valts Blukis , Dieter Fox , Leonidas Guibas , Stan BirchfieldComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
Abstract: We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images, 3D reconstructions, and kinematics, our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: this https URL
- [473] arXiv:2404.01446 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Finding Regions of Interest in Whole Slide Images Using Multiple Instance LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Whole Slide Images (WSI), obtained by high-resolution digital scanning of microscope slides at multiple scales, are the cornerstone of modern Digital Pathology. However, they represent a particular challenge to AI-based/AI-mediated analysis because pathology labeling is typically done at slide-level, instead of tile-level. It is not just that medical diagnostics is recorded at the specimen level, the detection of oncogene mutation is also experimentally obtained, and recorded by initiatives like The Cancer Genome Atlas (TCGA), at the slide level. This configures a dual challenge: a) accurately predicting the overall cancer phenotype and b) finding out what cellular morphologies are associated with it at the tile level. To address these challenges, a weakly supervised Multiple Instance Learning (MIL) approach was explored for two prevalent cancer types, Invasive Breast Carcinoma (TCGA-BRCA) and Lung Squamous Cell Carcinoma (TCGA-LUSC). This approach was explored for tumor detection at low magnification levels and TP53 mutations at various levels. Our results show that a novel additive implementation of MIL matched the performance of reference implementation (AUC 0.96), and was only slightly outperformed by Attention MIL (AUC 0.97). More interestingly from the perspective of the molecular pathologist, these different AI architectures identify distinct sensitivities to morphological features (through the detection of Regions of Interest, RoI) at different amplification levels. Tellingly, TP53 mutation was most sensitive to features at the higher applications where cellular morphology is resolved.
- [474] arXiv:2404.01453 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Unveiling Divergent Inductive Biases of LLMs on Temporal DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards "TRUE'', and GPT-4 exhibits a preference for "FALSE'' in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity.
- [475] arXiv:2404.01459 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Game-Theoretic Deep Reinforcement Learning to Minimize Carbon Emissions and Energy Costs for AI Inference Workloads in Geo-Distributed Data CentersComments: arXiv admin note: text overlap with arXiv:2106.00066Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Data centers are increasingly using more energy due to the rise in Artificial Intelligence (AI) workloads, which negatively impacts the environment and raises operational costs. Reducing operating expenses and carbon emissions while maintaining performance in data centers is a challenging problem. This work introduces a unique approach combining Game Theory (GT) and Deep Reinforcement Learning (DRL) for optimizing the distribution of AI inference workloads in geo-distributed data centers to reduce carbon emissions and cloud operating (energy + data transfer) costs. The proposed technique integrates the principles of non-cooperative Game Theory into a DRL framework, enabling data centers to make intelligent decisions regarding workload allocation while considering the heterogeneity of hardware resources, the dynamic nature of electricity prices, inter-data center data transfer costs, and carbon footprints. We conducted extensive experiments comparing our game-theoretic DRL (GT-DRL) approach with current DRL-based and other optimization techniques. The results demonstrate that our strategy outperforms the state-of-the-art in reducing carbon emissions and minimizing cloud operating costs without compromising computational performance. This work has significant implications for achieving sustainability and cost-efficiency in data centers handling AI inference workloads across diverse geographic locations.
- [476] arXiv:2404.01464 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical ImagesComments: CVPR 2024Subjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: 4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Given these circumstances, not only is data acquisition challenging, but increasing the frame rate for each dataset also proves difficult. To address this challenge, this paper proposes a simple yet effective Unsupervised Volumetric Interpolation framework, UVI-Net. This framework facilitates temporal interpolation without the need for any intermediate frames, distinguishing it from the majority of other existing unsupervised methods. Experiments on benchmark datasets demonstrate significant improvements across diverse evaluation metrics compared to unsupervised and supervised baselines. Remarkably, our approach achieves this superior performance even when trained with a dataset as small as one, highlighting its exceptional robustness and efficiency in scenarios with sparse supervision. This positions UVI-Net as a compelling alternative for 4D medical imaging, particularly in settings where data availability is limited. The source code is available at this https URL .
- [477] arXiv:2404.01475 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Are large language models superhuman chemists?Adrian Mirza , Nawaf Alampara , Sreekanth Kunchapu , Benedict Emoekabu , Aswanth Krishnan , Mara Wilhelmi , Macjonathan Okereke , Juliane Eberhardt , Amir Mohammad Elahi , Maximilian Greiner , Caroline T. Holick , Tanya Gupta , Mehrdad Asgari , Christina Glaubitz , Lea C. Klepsch , Yannik Köster , Jakob Meyer , Santiago Miret , Tim Hoffmann , Fabian Alexander Kreth , Michael Ringleb , Nicole Roesner , Ulrich S. Schubert , Leanne M. Stafast , Dinga Wonanke , Michael Pieler , Philippe Schwaller , Kevin Maik JablonkaSubjects: Machine Learning (cs.LG) ; Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
Abstract: Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.
- [478] arXiv:2404.01476 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TraveLER: A Multi-LMM Agent Framework for Video Question-AnsweringSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets.
- [479] arXiv:2404.01486 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: QuAD: Query-based Interpretable Neural Motion Planning for Autonomous DrivingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: A self-driving vehicle must understand its environment to determine the appropriate action. Traditional autonomy systems rely on object detection to find the agents in the scene. However, object detection assumes a discrete set of objects and loses information about uncertainty, so any errors compound when predicting the future behavior of those agents. Alternatively, dense occupancy grid maps have been utilized to understand free-space. However, predicting a grid for the entire scene is wasteful since only certain spatio-temporal regions are reachable and relevant to the self-driving vehicle. We present a unified, interpretable, and efficient autonomy framework that moves away from cascading modules that first perceive, then predict, and finally plan. Instead, we shift the paradigm to have the planner query occupancy at relevant spatio-temporal points, restricting the computation to those regions of interest. Exploiting this representation, we evaluate candidate trajectories around key factors such as collision avoidance, comfort, and progress for safety and interpretability. Our approach achieves better highway driving quality than the state-of-the-art in high-fidelity closed-loop simulations.
- [480] arXiv:2404.01492 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Modality Translation for Object Detection Adaptation Without Forgetting Prior KnowledgeHeitor Rapela Medeiros , Masih Aminbeidokhti , Fidel Guerrero Pena , David Latortue , Eric Granger , Marco PedersoliSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: A common practice in deep learning consists of training large neural networks on massive datasets to perform accurately for different domains and tasks. While this methodology may work well in numerous application areas, it only applies across modalities due to a larger distribution shift in data captured using different sensors. This paper focuses on the problem of adapting a large object detection model to one or multiple modalities while being efficient. To do so, we propose ModTr as an alternative to the common approach of fine-tuning large models. ModTr consists of adapting the input with a small transformation network trained to minimize the detection loss directly. The original model can therefore work on the translated inputs without any further change or fine-tuning to its parameters. Experimental results on translating from IR to RGB images on two well-known datasets show that this simple ModTr approach provides detectors that can perform comparably or better than the standard fine-tuning without forgetting the original knowledge. This opens the doors to a more flexible and efficient service-based detection pipeline in which, instead of using a different detector for each modality, a unique and unaltered server is constantly running, where multiple modalities with the corresponding translations can query it. Code: this https URL .
- [481] arXiv:2404.01509 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Can Biases in ImageNet Models Explain Generalization?Comments: Accepted at CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: The robust generalization of models to rare, in-distribution (ID) samples drawn from the long tail of the training distribution and to out-of-training-distribution (OOD) samples is one of the major challenges of current deep learning methods. For image classification, this manifests in the existence of adversarial attacks, the performance drops on distorted images, and a lack of generalization to concepts such as sketches. The current understanding of generalization in neural networks is very limited, but some biases that differentiate models from human vision have been identified and might be causing these limitations. Consequently, several attempts with varying success have been made to reduce these biases during training to improve generalization. We take a step back and sanity-check these attempts. Fixing the architecture to the well-established ResNet-50, we perform a large-scale study on 48 ImageNet models obtained via different training methods to understand how and if these biases - including shape bias, spectral biases, and critical bands - interact with generalization. Our extensive study results reveal that contrary to previous findings, these biases are insufficient to accurately predict the generalization of a model holistically. We provide access to all checkpoints and evaluation code at this https URL
- [482] arXiv:2404.01524 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: On Train-Test Class Overlap and Detection for Image RetrievalComments: CVPR2024 AcceptedSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean, the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [34], the most popular evaluation set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. Our dataset is available at this https URL .
- [483] arXiv:2404.01536 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Laying Anchors: Semantically Priming Numerals in Language ModelingComments: Accepted to the findings of NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Off-the-shelf pre-trained language models have become the de facto standard in NLP pipelines for a multitude of downstream tasks. However, the inability of these models to properly encode numerals limits their performance on tasks requiring numeric comprehension. We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus, thereby enabling mathematically grounded representations of these numeral tokens. We establish the superiority of our proposed techniques through evaluation on a range of numeracy tasks for both in-domain (seen) and out-domain (unseen) numerals. Further, we expand our empirical evaluations to numerals ranging from 1 to 10 billion, a significantly broader range compared to previous studies of the same nature, and we demonstrate significant improvements in the mathematical grounding of our learned embeddings.
- [484] arXiv:2404.01548 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and ReasoningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.
- [485] arXiv:2404.01551 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Multi-Agent Reinforcement Learning with Control-Theoretic Safety Guarantees for Dynamic Network BridgingComments: 7 pages, 21 equations, 3 figures, 1 algorithm, and 1 tableSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Abstract: Addressing complex cooperative tasks in safety-critical environments poses significant challenges for Multi-Agent Systems, especially under conditions of partial observability. This work introduces a hybrid approach that integrates Multi-Agent Reinforcement Learning with control-theoretic methods to ensure safe and efficient distributed strategies. Our contributions include a novel setpoint update algorithm that dynamically adjusts agents' positions to preserve safety conditions without compromising the mission's objectives. Through experimental validation, we demonstrate significant advantages over conventional MARL strategies, achieving comparable task performance with zero safety violations. Our findings indicate that integrating safe control with learning approaches not only enhances safety compliance but also achieves good performance in mission objectives.
- [486] arXiv:2404.01557 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Distributed Autonomous Swarm Formation for Dynamic Network BridgingRaffaele Galliera , Thies Möhlenhof , Alessandro Amato , Daniel Duran , Kristen Brent Venable , Niranjan SuriComments: 6 pages, 3 figures, 1 table, 1 algorithmSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Effective operation and seamless cooperation of robotic systems are a fundamental component of next-generation technologies and applications. In contexts such as disaster response, swarm operations require coordinated behavior and mobility control to be handled in a distributed manner, with the quality of the agents' actions heavily relying on the communication between them and the underlying network. In this paper, we formulate the problem of dynamic network bridging in a novel Decentralized Partially Observable Markov Decision Process (Dec-POMDP), where a swarm of agents cooperates to form a link between two distant moving targets. Furthermore, we propose a Multi-Agent Reinforcement Learning (MARL) approach for the problem based on Graph Convolutional Reinforcement Learning (DGN) which naturally applies to the networked, distributed nature of the task. The proposed method is evaluated in a simulated environment and compared to a centralized heuristic baseline showing promising results. Moreover, a further step in the direction of sim-to-real transfer is presented, by additionally evaluating the proposed approach in a near Live Virtual Constructive (LVC) UAV framework.
- [487] arXiv:2404.01558 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Automated User Story Generation with Test Case Specification Using Large Language ModelComments: 10 pages including 2 pages of AppendixSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Modern Software Engineering era is moving fast with the assistance of artificial intelligence (AI), especially Large Language Models (LLM). Researchers have already started automating many parts of the software development workflow. Requirements Engineering (RE) is a crucial phase that begins the software development cycle through multiple discussions on a proposed scope of work documented in different forms. RE phase ends with a list of user-stories for each unit task identified through discussions and usually these are created and tracked on a project management tool such as Jira, AzurDev etc. In this research we developed a tool "GeneUS" using GPT-4.0 to automatically create user stories from requirements document which is the outcome of the RE phase. The output is provided in JSON format leaving the possibilities open for downstream integration to the popular project management tools. Analyzing requirements documents takes significant effort and multiple meetings with stakeholders. We believe, automating this process will certainly reduce additional load off the software engineers, and increase the productivity since they will be able to utilize their time on other prioritized tasks.
- [488] arXiv:2404.01569 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating Large Language Models Using Contrast Sets: An Experimental ApproachSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the domain of Natural Language Inference (NLI), especially in tasks involving the classification of multiple input texts, the Cross-Entropy Loss metric is widely employed as a standard for error measurement. However, this metric falls short in effectively evaluating a model's capacity to understand language entailments. In this study, we introduce an innovative technique for generating a contrast set for the Stanford Natural Language Inference (SNLI) dataset. Our strategy involves the automated substitution of verbs, adverbs, and adjectives with their synonyms to preserve the original meaning of sentences. This method aims to assess whether a model's performance is based on genuine language comprehension or simply on pattern recognition. We conducted our analysis using the ELECTRA-small model. The model achieved an accuracy of 89.9% on the conventional SNLI dataset but showed a reduced accuracy of 72.5% on our contrast set, indicating a substantial 17% decline. This outcome led us to conduct a detailed examination of the model's learning behaviors. Following this, we improved the model's resilience by fine-tuning it with a contrast-enhanced training dataset specifically designed for SNLI, which increased its accuracy to 85.5% on the contrast sets. Our findings highlight the importance of incorporating diverse linguistic expressions into datasets for NLI tasks. We hope that our research will encourage the creation of more inclusive datasets, thereby contributing to the development of NLI models that are both more sophisticated and effective.
- [489] arXiv:2404.01582 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: BERT-Enhanced Retrieval Tool for Homework Plagiarism Detection SystemSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Text plagiarism detection task is a common natural language processing task that aims to detect whether a given text contains plagiarism or copying from other texts. In existing research, detection of high level plagiarism is still a challenge due to the lack of high quality datasets. In this paper, we propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets covering a wide range of plagiarism methods, bridging the gap in this part of research. Meanwhile, we propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy. Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86\%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively. At the end, we also provide a user-friendly demo platform that allows users to upload a text library and intuitively participate in the plagiarism analysis.
- [490] arXiv:2404.01588 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Hallucination Diversity-Aware Active Learning for Text SummarizationComments: Accepted to NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have shown propensity to generate hallucinated outputs, i.e., texts that are factually incorrect or unsupported. Existing methods for alleviating hallucinations typically require costly human annotations to identify and correct hallucinations in LLM outputs. Moreover, most of these methods focus on a specific type of hallucination, e.g., entity or token errors, which limits their effectiveness in addressing various types of hallucinations exhibited in LLM outputs. To our best knowledge, in this paper we propose the first active learning framework to alleviate LLM hallucinations, reducing costly human annotations of hallucination needed. By measuring fine-grained hallucinations from errors in semantic frame, discourse and content verifiability in text summarization, we propose HAllucination Diversity-Aware Sampling (HADAS) to select diverse hallucinations for annotations in active learning for LLM finetuning. Extensive experiments on three datasets and different backbone models demonstrate advantages of our method in effectively and efficiently mitigating LLM hallucinations.
- [491] arXiv:2404.01589 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Classifying Cancer Stage with Open-Source Clinical Large Language ModelsComments: accepted in the IEEE International Conference on Healthcare Informatics (IEEE ICHI 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Cancer stage classification is important for making treatment and care management plans for oncology patients. Information on staging is often included in unstructured form in clinical, pathology, radiology and other free-text reports in the electronic health record system, requiring extensive work to parse and obtain. To facilitate the extraction of this information, previous NLP approaches rely on labeled training datasets, which are labor-intensive to prepare. In this study, we demonstrate that without any labeled training data, open-source clinical large language models (LLMs) can extract pathologic tumor-node-metastasis (pTNM) staging information from real-world pathology reports. Our experiments compare LLMs and a BERT-based model fine-tuned using the labeled data. Our findings suggest that while LLMs still exhibit subpar performance in Tumor (T) classification, with the appropriate adoption of prompting strategies, they can achieve comparable performance on Metastasis (M) classification and improved performance on Node (N) classification.
- [492] arXiv:2404.01596 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: PhysORD: A Neuro-Symbolic Approach for Physics-infused Motion Prediction in Off-road DrivingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Motion prediction is critical for autonomous off-road driving, however, it presents significantly more challenges than on-road driving because of the complex interaction between the vehicle and the terrain. Traditional physics-based approaches encounter difficulties in accurately modeling dynamic systems and external disturbance. In contrast, data-driven neural networks require extensive datasets and struggle with explicitly capturing the fundamental physical laws, which can easily lead to poor generalization. By merging the advantages of both methods, neuro-symbolic approaches present a promising direction. These methods embed physical laws into neural models, potentially significantly improving generalization capabilities. However, no prior works were evaluated in real-world settings for off-road driving. To bridge this gap, we present PhysORD, a neural-symbolic approach integrating the conservation law, i.e., the Euler-Lagrange equation, into data-driven neural models for motion prediction in off-road driving. Our experiments showed that PhysORD can accurately predict vehicle motion and tolerate external disturbance by modeling uncertainties. It outperforms existing methods both in accuracy and efficiency and demonstrates data-efficient learning and generalization ability in long-term prediction.
- [493] arXiv:2404.01598 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Extremum-Seeking Action Selection for Accelerating Policy OptimizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Reinforcement learning for control over continuous spaces typically uses high-entropy stochastic policies, such as Gaussian distributions, for local exploration and estimating policy gradient to optimize performance. Many robotic control problems deal with complex unstable dynamics, where applying actions that are off the feasible control manifolds can quickly lead to undesirable divergence. In such cases, most samples taken from the ambient action space generate low-value trajectories that hardly contribute to policy improvement, resulting in slow or failed learning. We propose to improve action selection in this model-free RL setting by introducing additional adaptive control steps based on Extremum-Seeking Control (ESC). On each action sampled from stochastic policies, we apply sinusoidal perturbations and query for estimated Q-values as the response signal. Based on ESC, we then dynamically improve the sampled actions to be closer to nearby optima before applying them to the environment. Our methods can be easily added in standard policy optimization to improve learning efficiency, which we demonstrate in various control learning environments.
- [494] arXiv:2404.01602 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Helmsman of the Masses? Evaluate the Opinion Leadership of Large Language Models in the Werewolf GameComments: 32 pages, 6 figures, 21 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Large language models (LLMs) have exhibited memorable strategic behaviors in social deductive games. However, the significance of opinion leadership exhibited by LLM-based agents has been overlooked, which is crucial for practical applications in multi-agent and human-AI interaction settings. Opinion leaders are individuals who have a noticeable impact on the beliefs and behaviors of others within a social group. In this work, we employ the Werewolf game as a simulation platform to assess the opinion leadership of LLMs. The game features the role of the Sheriff, tasked with summarizing arguments and recommending decision options, and therefore serves as a credible proxy for an opinion leader. We develop a framework integrating the Sheriff role and devise two novel metrics for evaluation based on the critical characteristics of opinion leaders. The first metric measures the reliability of the opinion leader, and the second assesses the influence of the opinion leader on other players' decisions. We conduct extensive experiments to evaluate LLMs of different scales. In addition, we collect a Werewolf question-answering dataset (WWQA) to assess and enhance LLM's grasp of the game rules, and we also incorporate human participants for further analysis. The results suggest that the Werewolf game is a suitable test bed to evaluate the opinion leadership of LLMs and few LLMs possess the capacity for opinion leadership.
- [495] arXiv:2404.01620 (cross-list from cs.SD) [ pdf , ps , other ]
-
Title: Voice EHR: Introducing Multimodal Audio Data for HealthJames Anibal , Hannah Huth , Ming Li , Lindsey Hazen , Yen Minh Lam , Nguyen Thi Thu Hang , Michael Kleinman , Shelley Ost , Christopher Jackson , Laura Sprabery , Cheran Elangovan , Balaji Krishnaiah , Lee Akst , Ioan Lina , Iqbal Elyazar , Lenny Ekwati , Stefan Jansen , Richard Nduwayezu , Charisse Garcia , Jeffrey Plum , Jacqueline Brenner , Miranda Song , Emily Ricotta , David Clifton , C. Louise Thwaites , Yael Bensoussan , Bradford WoodComments: 18 pages, 2 figures, 7 tablesSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Audio and Speech Processing (eess.AS)
Abstract: Large AI models trained on audio data may have the potential to rapidly classify patients, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets using expensive recording equipment in high-income, English-speaking countries. This challenges deployment in resource-constrained, high-volume settings where audio data may have a profound impact. This report introduces a novel data type and a corresponding collection system that captures health data through guided questions using only a mobile/web application. This application ultimately results in an audio electronic health record (voice EHR) which may contain complex biomarkers of health from conventional voice/respiratory features, speech patterns, and language with semantic meaning - compensating for the typical limitations of unimodal clinical datasets. This report introduces a consortium of partners for global work, presents the application used for data collection, and showcases the potential of informative voice EHR to advance the scalability and diversity of audio AI.
- [496] arXiv:2404.01622 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Gen4DS: Workshop on Data Storytelling in an Era of Generative AISubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: Storytelling is an ancient and precious human ability that has been rejuvenated in the digital age. Over the last decade, there has been a notable surge in the recognition and application of data storytelling, both in academia and industry. Recently, the rapid development of generative AI has brought new opportunities and challenges to this field, sparking numerous new questions. These questions may not necessarily be quickly transformed into papers, but we believe it is necessary to promptly discuss them to help the community better clarify important issues and research agendas for the future. We thus invite you to join our workshop (Gen4DS) to discuss questions such as: How can generative AI facilitate the creation of data stories? How might generative AI alter the workflow of data storytellers? What are the pitfalls and risks of incorporating AI in storytelling? We have designed both paper presentations and interactive activities (including hands-on creation, group discussion pods, and debates on controversial issues) for the workshop. We hope that participants will learn about the latest advances and pioneering work in data storytelling, engage in critical conversations with each other, and have an enjoyable, unforgettable, and meaningful experience at the event.
- [497] arXiv:2404.01628 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning Equi-angular Representations for Online Continual LearningMinhyuk Seo , Hyunseo Koh , Wonje Jeung , Minjae Lee , San Kim , Hankook Lee , Sungjun Cho , Sungik Choi , Hyunwoo Kim , Jonghyun ChoiComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Online continual learning suffers from an underfitted solution due to insufficient training for prompt model update (e.g., single-epoch training). To address the challenge, we propose an efficient online continual learning method using the neural collapse phenomenon. In particular, we induce neural collapse to form a simplex equiangular tight frame (ETF) structure in the representation space so that the continuously learned model with a single epoch can better fit to the streamed data by proposing preparatory data training and residual correction in the representation space. With an extensive set of empirical validations using CIFAR-10/100, TinyImageNet, ImageNet-200, and ImageNet-1K, we show that our proposed method outperforms state-of-the-art methods by a noticeable margin in various online continual learning scenarios such as disjoint and Gaussian scheduled continuous (i.e., boundary-free) data setups.
- [498] arXiv:2404.01636 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning to Control Camera Exposure via Reinforcement LearningComments: Accepted at CVPR 2024, *First two authors contributed equally to this work. Project page link: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Abstract: Adjusting camera exposure in arbitrary lighting conditions is the first step to ensure the functionality of computer vision applications. Poorly adjusted camera exposure often leads to critical failure and performance degradation. Traditional camera exposure control methods require multiple convergence steps and time-consuming processes, making them unsuitable for dynamic lighting conditions. In this paper, we propose a new camera exposure control framework that rapidly controls camera exposure while performing real-time processing by exploiting deep reinforcement learning. The proposed framework consists of four contributions: 1) a simplified training ground to simulate real-world's diverse and dynamic lighting changes, 2) flickering and image attribute-aware reward design, along with lightweight state design for real-time processing, 3) a static-to-dynamic lighting curriculum to gradually improve the agent's exposure-adjusting capability, and 4) domain randomization techniques to alleviate the limitation of the training ground and achieve seamless generalization in the this http URL a result, our proposed method rapidly reaches a desired exposure level within five steps with real-time processing (1 ms). Also, the acquired images are well-exposed and show superiority in various computer vision tasks, such as feature extraction and object detection.
- [499] arXiv:2404.01652 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Better Generalization in Open-Domain Question Answering by Mitigating Context MemorizationComments: Accepted to NAACL 2024 FindingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Open-domain Question Answering (OpenQA) aims at answering factual questions with an external large-scale knowledge corpus. However, real-world knowledge is not static; it updates and evolves continually. Such a dynamic characteristic of knowledge poses a vital challenge for these models, as the trained models need to constantly adapt to the latest information to make sure that the answers remain accurate. In addition, it is still unclear how well an OpenQA model can transfer to completely new knowledge domains. In this paper, we investigate the generalization performance of a retrieval-augmented QA model in two specific scenarios: 1) adapting to updated versions of the same knowledge corpus; 2) switching to completely different knowledge domains. We observe that the generalization challenges of OpenQA models stem from the reader's over-reliance on memorizing the knowledge from the external corpus, which hinders the model from generalizing to a new knowledge corpus. We introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy, to mitigate the knowledge over-memorization by controlling the likelihood of retrieved contexts during training. Extensive experimental results on multiple OpenQA benchmarks show that CIT achieves significantly better generalizability without compromising the model's performance in its original corpus and domain.
- [500] arXiv:2404.01654 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AI WALKUP: A Computer-Vision Approach to Quantifying MDS-UPDRS in Parkinson's DiseaseComments: Technical report for AI WALKUP, an APP winning 3rd Prize of 2022 HUST GS AI Innovation and Design CompetitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Abstract: Parkinson's Disease (PD) is the second most common neurodegenerative disorder. The existing assessment method for PD is usually the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to assess the severity of various types of motor symptoms and disease progression. However, manual assessment suffers from high subjectivity, lack of consistency, and high cost and low efficiency of manual communication. We want to use a computer vision based solution to capture human pose images based on a camera, reconstruct and perform motion analysis using algorithms, and extract the features of the amount of motion through feature engineering. The proposed approach can be deployed on different smartphones, and the video recording and artificial intelligence analysis can be done quickly and easily through our APP.
- [501] arXiv:2404.01657 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Release of Pre-Trained Models for the Japanese LanguageKei Sawada , Tianyu Zhao , Makoto Shing , Kentaro Mitsui , Akio Kaga , Yukiya Hono , Toshiaki Wakatsuki , Koh MitsudaComments: 9 pages, 1 figure, 5 tables, accepted for LREC-COLING 2024. Models are publicly available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.
- [502] arXiv:2404.01685 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: A Methodology for Improving Accuracy of Embedded Spiking Neural Networks through Kernel Size ScalingComments: 3 pages, 3 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Spiking Neural Networks (SNNs) can offer ultra low power/ energy consumption for machine learning-based applications due to their sparse spike-based operations. Currently, most of the SNN architectures need a significantly larger model size to achieve higher accuracy, which is not suitable for resource-constrained embedded applications. Therefore, developing SNNs that can achieve high accuracy with acceptable memory footprint is highly needed. Toward this, we propose a novel methodology that improves the accuracy of SNNs through kernel size scaling. Its key steps include investigating the impact of different kernel sizes on the accuracy, devising new sets of kernel sizes, generating SNN architectures based on the selected kernel sizes, and analyzing the accuracy-memory trade-offs for SNN model selection. The experimental results show that our methodology achieves higher accuracy than state-of-the-art (93.24% accuracy for CIFAR10 and 70.84% accuracy for CIFAR100) with less than 10M parameters and up to 3.45x speed-up of searching time, thereby making it suitable for embedded applications.
- [503] arXiv:2404.01709 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Upsample Guidance: Scale Up Diffusion Models without TrainingComments: 15 pages, 15 FiguresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to generate higher-resolution images (e.g., $1536^2$) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.
- [504] arXiv:2404.01712 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Online Unlearning via Hessian-Free Recollection of Individual Data StatisticsComments: 25 pages, 8 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Machine unlearning strives to uphold the data owners' right to be forgotten by enabling models to selectively forget specific data. Recent methods suggest that one approach of data forgetting is by precomputing and storing statistics carrying second-order information to improve computational and memory efficiency. However, they rely on restrictive assumptions and the computation/storage suffer from the curse of model parameter dimensionality, making it challenging to apply to most deep neural networks. In this work, we propose a Hessian-free online unlearning method. We propose to maintain a statistical vector for each data point, computed through affine stochastic recursion approximation of the difference between retrained and learned models. Our proposed algorithm achieves near-instantaneous online unlearning as it only requires a vector addition operation. Based on the strategy that recollecting statistics for forgetting data, the proposed method significantly reduces the unlearning runtime. Experimental studies demonstrate that the proposed scheme surpasses existing results by orders of magnitude in terms of time and memory costs, while also enhancing accuracy.
- [505] arXiv:2404.01713 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Generative AI for Immersive Communication: The Next Frontier in Internet-of-Senses Through 6GSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI)
Abstract: Over the past two decades, the Internet-of-Things (IoT) has been a transformative concept, and as we approach 2030, a new paradigm known as the Internet of Senses (IoS) is emerging. Unlike conventional Virtual Reality (VR), IoS seeks to provide multi-sensory experiences, acknowledging that in our physical reality, our perception extends far beyond just sight and sound; it encompasses a range of senses. This article explores existing technologies driving immersive multi-sensory media, delving into their capabilities and potential applications. This exploration includes a comparative analysis between conventional immersive media streaming and a proposed use case that leverages semantic communication empowered by generative Artificial Intelligence (AI). The focal point of this analysis is the substantial reduction in bandwidth consumption by 99.93% in the proposed scheme. Through this comparison, we aim to underscore the practical applications of generative AI for immersive media while addressing the challenges and outlining future trajectories.
- [506] arXiv:2404.01714 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep LearningComments: 32 pages, 13 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
Abstract: Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
- [507] arXiv:2404.01716 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: Effective internal language model training and fusion for factorized transducer modelJinxi Guo , Niko Moritz , Yingyi Ma , Frank Seide , Chunyang Wu , Jay Mahadeokar , Ozlem Kalinli , Christian Fuegen , Mike SeltzerComments: Accepted to ICASSP 2024Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly.
- [508] arXiv:2404.01740 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Weakly-supervised Audio Separation via Bi-modal Semantic SimilarityComments: Tech report. Accepted in ICLR-2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework, that enables a powerful semi-supervised framework for audio separation.
- [509] arXiv:2404.01741 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Intrusion Tolerance for Networked Systems through Two-Level Feedback ControlComments: Preprint to appear in the conference proceedings of the 54th IEEE/IFIP Dependable Systems and Networks Conference (DSN'24)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
Abstract: We formulate intrusion tolerance for a system with service replicas as a two-level optimal control problem. On the local level node controllers perform intrusion recovery, and on the global level a system controller manages the replication factor. The local and global control problems can be formulated as classical problems in operations research, namely, the machine replacement problem and the inventory replenishment problem. Based on this formulation, we design TOLERANCE, a novel control architecture for intrusion-tolerant systems. We prove that the optimal control strategies on both levels have threshold structure and design efficient algorithms for computing them. We implement and evaluate TOLERANCE in an emulation environment where we run 10 types of network intrusions. The results show that TOLERANCE can improve service availability and reduce operational cost compared with state-of-the-art intrusion-tolerant systems.
- [510] arXiv:2404.01745 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Unleash the Potential of CLIP for Video Highlight DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.
- [511] arXiv:2404.01746 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Towards Scalable & Efficient Interaction-Aware Planning in Autonomous Vehicles using Knowledge DistillationSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Real-world driving involves intricate interactions among vehicles navigating through dense traffic scenarios. Recent research focuses on enhancing the interaction awareness of autonomous vehicles to leverage these interactions in decision-making. These interaction-aware planners rely on neural-network-based prediction models to capture inter-vehicle interactions, aiming to integrate these predictions with traditional control techniques such as Model Predictive Control. However, this integration of deep learning-based models with traditional control paradigms often results in computationally demanding optimization problems, relying on heuristic methods. This study introduces a principled and efficient method for combining deep learning with constrained optimization, employing knowledge distillation to train smaller and more efficient networks, thereby mitigating complexity. We demonstrate that these refined networks maintain the problem-solving efficacy of larger models while significantly accelerating optimization. Specifically, in the domain of interaction-aware trajectory planning for autonomous vehicles, we illustrate that training a smaller prediction network using knowledge distillation speeds up optimization without sacrificing accuracy.
- [512] arXiv:2404.01752 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Safe Interval RRT* for Scalable Multi-Robot Path Planning in Continuous SpaceSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: In this paper, we consider the problem of Multi-Robot Path Planning (MRPP) in continuous space to find conflict-free paths. The difficulty of the problem arises from two primary factors. First, the involvement of multiple robots leads to combinatorial decision-making, which escalates the search space exponentially. Second, the continuous space presents potentially infinite states and actions. For this problem, we propose a two-level approach where the low level is a sampling-based planner Safe Interval RRT* (SI-RRT*) that finds a collision-free trajectory for individual robots. The high level can use any method that can resolve inter-robot conflicts where we employ two representative methods that are Prioritized Planning (SI-CPP) and Conflict Based Search (SI-CCBS). Experimental results show that SI-RRT* can find a high-quality solution quickly with a small number of samples. SI-CPP exhibits improved scalability while SI-CCBS produces higher-quality solutions compared to the state-of-the-art planners for continuous space. Compared to the most scalable existing algorithm, SI-CPP achieves a success rate that is up to 94% higher with 100 robots while maintaining solution quality (i.e., flowtime, the sum of travel times of all robots) without significant compromise. SI-CPP also decreases the makespan up to 45%. SI-CCBS decreases the flowtime by 9% compared to the competitor, albeit exhibiting a 14% lower success rate.
- [513] arXiv:2404.01754 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student AssignmentsQianhui Zhao , Fang Liu , Li Zhang , Yang Liu , Zhen Yan , Zhenghao Chen , Yufei Zhou , Jing Jiang , Ge LiComments: On-going workSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Automated generation of feedback on programming assignments holds significant benefits for programming education, especially when it comes to advanced assignments. Automated Program Repair techniques, especially Large Language Model based approaches, have gained notable recognition for their potential to fix introductory assignments. However, the programs used for evaluation are relatively simple. It remains unclear how existing approaches perform in repairing programs from higher-level programming courses. To address these limitations, we curate a new advanced student assignment dataset named Defects4DS from a higher-level programming course. Subsequently, we identify the challenges related to fixing bugs in advanced assignments. Based on the analysis, we develop a framework called PaR that is powered by the LLM. PaR works in three phases: Peer Solution Selection, Multi-Source Prompt Generation, and Program Repair. Peer Solution Selection identifies the closely related peer programs based on lexical, semantic, and syntactic criteria. Then Multi-Source Prompt Generation adeptly combines multiple sources of information to create a comprehensive and informative prompt for the last Program Repair stage. The evaluation on Defects4DS and another well-investigated ITSP dataset reveals that PaR achieves a new state-of-the-art performance, demonstrating impressive improvements of 19.94% and 15.2% in repair rate compared to prior state-of-the-art LLM- and symbolic-based approaches, respectively
- [514] arXiv:2404.01768 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based Bias EvaluationComments: Under reviewed as a conference paper at COLM 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in Large Language Models (LLMs) have significantly increased their presence in human-facing Artificial Intelligence (AI) applications. However, LLMs could reproduce and even exacerbate stereotypical outputs from training data. This work introduces the Multi-Grain Stereotype (MGS) dataset, encompassing 51,867 instances across gender, race, profession, religion, and stereotypical text, collected by fusing multiple previously publicly available stereotype detection datasets. We explore different machine learning approaches aimed at establishing baselines for stereotype detection, and fine-tune several language models of various architectures and model sizes, presenting in this work a series of stereotypes classifier models for English text trained on MGS. To understand whether our stereotype detectors capture relevant features (aligning with human common sense) we utilise a variety of explanainable AI tools, including SHAP, LIME, and BertViz, and analyse a series of example cases discussing the results. Finally, we develop a series of stereotype elicitation prompts and evaluate the presence of stereotypes in text generation tasks with popular LLMs, using one of our best performing previously presented stereotypes detectors. Our experiments yielded several key findings: i) Training stereotype detectors in a multi-dimension setting yields better results than training multiple single-dimension classifiers.ii) The integrated MGS Dataset enhances both the in-dataset and cross-dataset generalisation ability of stereotype detectors compared to using the datasets separately. iii) There is a reduction in stereotypes in the content generated by GPT Family LLMs with newer versions.
- [515] arXiv:2404.01775 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?Comments: Accepted at CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification, the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods, there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean, carefully curated dataset. In this work, we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets, noise types & levels, architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection, and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: this https URL
- [516] arXiv:2404.01812 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation ActionsComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Manipulating unseen objects is challenging without a 3D representation, as objects generally have occluded surfaces. This requires physical interaction with objects to build their internal representations. This paper presents an approach that enables a robot to rapidly learn the complete 3D model of a given object for manipulation in unfamiliar orientations. We use an ensemble of partially constructed NeRF models to quantify model uncertainty to determine the next action (a visual or re-orientation action) by optimizing informativeness and feasibility. Further, our approach determines when and how to grasp and re-orient an object given its partial NeRF model and re-estimates the object pose to rectify misalignments introduced during the interaction. Experiments with a simulated Franka Emika Robot Manipulator operating in a tabletop environment with benchmark objects demonstrate an improvement of (i) 14% in visual reconstruction quality (PSNR), (ii) 20% in the geometric/depth reconstruction of the object surface (F-score) and (iii) 71% in the task success rate of manipulating objects a-priori unseen orientations/stable configurations in the scene; over current methods. The project page can be found here: this https URL .
- [517] arXiv:2404.01828 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo ReplaySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep neural networks have demonstrated susceptibility to adversarial attacks. Adversarial defense techniques often focus on one-shot setting to maintain robustness against attack. However, new attacks can emerge in sequences in real-world deployment scenarios. As a result, it is crucial for a defense model to constantly adapt to new attacks, but the adaptation process can lead to catastrophic forgetting of previously defended against attacks. In this paper, we discuss for the first time the concept of continual adversarial defense under a sequence of attacks, and propose a lifelong defense baseline called Anisotropic \& Isotropic Replay (AIR), which offers three advantages: (1) Isotropic replay ensures model consistency in the neighborhood distribution of new data, indirectly aligning the output preference between old and new tasks. (2) Anisotropic replay enables the model to learn a compromise data manifold with fresh mixed semantics for further replay constraints and potential future attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability' trade-off by aligning model output between new and old tasks. Experiment results demonstrate that AIR can approximate or even exceed the empirical performance upper bounds achieved by Joint Training.
- [518] arXiv:2404.01833 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak AttackSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies, progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we introduce Crescendomation, a tool that automates the Crescendo attack, and our evaluation showcases its effectiveness against state-of-the-art models.
- [519] arXiv:2404.01849 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingComments: 10 pages, 9 figures, and 6 tablesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: As electric vehicle (EV) numbers rise, concerns about the capacity of current charging and power grid infrastructure grow, necessitating the development of smart charging solutions. While many smart charging simulators have been developed in recent years, only a few support the development of Reinforcement Learning (RL) algorithms in the form of a Gym environment, and those that do usually lack depth in modeling Vehicle-to-Grid (V2G) scenarios. To address the aforementioned issues, this paper introduces the EV2Gym, a realistic simulator platform for the development and assessment of small and large-scale smart charging algorithms within a standardized platform. The proposed simulator is populated with comprehensive EV, charging station, power transformer, and EV behavior models validated using real data. EV2Gym has a highly customizable interface empowering users to choose from pre-designed case studies or craft their own customized scenarios to suit their specific requirements. Moreover, it incorporates a diverse array of RL, mathematical programming, and heuristic algorithms to speed up the development and benchmarking of new solutions. By offering a unified and standardized platform, EV2Gym aims to provide researchers and practitioners with a robust environment for advancing and assessing smart charging algorithms.
- [520] arXiv:2404.01855 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Where to Move Next: Zero-shot Generalization of LLMs for Next POI RecommendationSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Next Point-of-interest (POI) recommendation provides valuable suggestions for users to explore their surrounding environment. Existing studies rely on building recommendation models from large-scale users' check-in data, which is task-specific and needs extensive computational resources. Recently, the pretrained large language models (LLMs) have achieved significant advancements in various NLP tasks and have also been investigated for recommendation scenarios. However, the generalization abilities of LLMs still are unexplored to address the next POI recommendations, where users' geographical movement patterns should be extracted. Although there are studies that leverage LLMs for next-item recommendations, they fail to consider the geographical influence and sequential transitions. Hence, they cannot effectively solve the next POI recommendation task. To this end, we design novel prompting strategies and conduct empirical studies to assess the capability of LLMs, e.g., ChatGPT, for predicting a user's next check-in. Specifically, we consider several essential factors in human movement behaviors, including user geographical preference, spatial distance, and sequential transitions, and formulate the recommendation task as a ranking problem. Through extensive experiments on two widely used real-world datasets, we derive several key findings. Empirical evaluations demonstrate that LLMs have promising zero-shot recommendation abilities and can provide accurate and reasonable predictions. We also reveal that LLMs cannot accurately comprehend geographical context information and are sensitive to the order of presentation of candidate POIs, which shows the limitations of LLMs and necessitates further research on robust human mobility reasoning mechanisms.
- [521] arXiv:2404.01863 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Confidence-aware Reward Optimization for Fine-tuning Text-to-Image ModelsKyuyoung Kim , Jongheon Jeong , Minyong An , Mohammad Ghavamzadeh , Krishnamurthy Dvijotham , Jinwoo Shin , Kimin LeeComments: ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models, a phenomenon known as reward overoptimization. To investigate this issue in depth, we introduce the Text-Image Alignment Assessment (TIA2) benchmark, which comprises a diverse collection of text prompts, images, and human annotations. Our evaluation of several state-of-the-art reward models on this benchmark reveals their frequent misalignment with human assessment. We empirically demonstrate that overoptimization occurs notably when a poorly aligned reward model is used as the fine-tuning objective. To address this, we propose TextNorm, a simple method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts. We demonstrate that incorporating the confidence-calibrated rewards in fine-tuning effectively reduces overoptimization, resulting in twice as many wins in human evaluation for text-image alignment compared against the baseline reward models.
- [522] arXiv:2404.01869 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A SurveyComments: 26 pages, 2 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs' reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models' reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models' reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on genuine reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.
- [523] arXiv:2404.01878 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Real, fake and synthetic faces -- does the coin have three sides?Shahzeb Naeem , Ramzi Al-Sharawi , Muhammad Riyyan Khan , Usman Tariq , Abhinav Dhall , Hasan Al-NashashSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: With the ever-growing power of generative artificial intelligence, deepfake and artificially generated (synthetic) media have continued to spread online, which creates various ethical and moral concerns regarding their usage. To tackle this, we thus present a novel exploration of the trends and patterns observed in real, deepfake and synthetic facial images. The proposed analysis is done in two parts: firstly, we incorporate eight deep learning models and analyze their performances in distinguishing between the three classes of images. Next, we look to further delve into the similarities and differences between these three sets of images by investigating their image properties both in the context of the entire image as well as in the context of specific regions within the image. ANOVA test was also performed and provided further clarity amongst the patterns associated between the images of the three classes. From our findings, we observe that the investigated deeplearning models found it easier to detect synthetic facial images, with the ViT Patch-16 model performing best on this task with a class-averaged sensitivity, specificity, precision, and accuracy of 97.37%, 98.69%, 97.48%, and 98.25%, respectively. This observation was supported by further analysis of various image properties. We saw noticeable differences across the three category of images. This analysis can help us build better algorithms for facial image generation, and also shows that synthetic, deepfake and real face images are indeed three different classes.
- [524] arXiv:2404.01889 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image EnhancementSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.
- [525] arXiv:2404.01897 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Continuous Spiking Graph Neural NetworksSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Continuous graph neural networks (CGNNs) have garnered significant attention due to their ability to generalize existing discrete graph neural networks (GNNs) by introducing continuous dynamics. They typically draw inspiration from diffusion-based methods to introduce a novel propagation scheme, which is analyzed using ordinary differential equations (ODE). However, the implementation of CGNNs requires significant computational power, making them challenging to deploy on battery-powered devices. Inspired by recent spiking neural networks (SNNs), which emulate a biological inference process and provide an energy-efficient neural architecture, we incorporate the SNNs with CGNNs in a unified framework, named Continuous Spiking Graph Neural Networks (COS-GNN). We employ SNNs for graph node representation at each time step, which are further integrated into the ODE process along with time. To enhance information preservation and mitigate information loss in SNNs, we introduce the high-order structure of COS-GNN, which utilizes the second-order ODE for spiking representation and continuous propagation. Moreover, we provide the theoretical proof that COS-GNN effectively mitigates the issues of exploding and vanishing gradients, enabling us to capture long-range dependencies between nodes. Experimental results on graph-based learning tasks demonstrate the effectiveness of the proposed COS-GNN over competitive baselines.
- [526] arXiv:2404.01914 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SCANNER: Knowledge-Enhanced Approach for Robust Multi-modal Named Entity Recognition of Unseen EntitiesComments: 13 pages, 7 figures, NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advances in named entity recognition (NER) have pushed the boundary of the task to incorporate visual signals, leading to many variants, including multi-modal NER (MNER) or grounded MNER (GMNER). A key challenge to these tasks is that the model should be able to generalize to the entities unseen during the training, and should be able to handle the training samples with noisy annotations. To address this obstacle, we propose SCANNER (Span CANdidate detection and recognition for NER), a model capable of effectively handling all three NER variants. SCANNER is a two-stage structure; we extract entity candidates in the first stage and use it as a query to get knowledge, effectively pulling knowledge from various sources. We can boost our performance by utilizing this entity-centric extracted knowledge to address unseen entities. Furthermore, to tackle the challenges arising from noisy annotations in NER datasets, we introduce a novel self-distillation method, enhancing the robustness and accuracy of our model in processing training data with inherent uncertainties. Our approach demonstrates competitive performance on the NER benchmark and surpasses existing methods on both MNER and GMNER benchmarks. Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks.
- [527] arXiv:2404.01923 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SGSH: Stimulate Large Language Models with Skeleton Heuristics for Knowledge Base Question GenerationComments: Accepted by NAACL 2024 FindingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge base question generation (KBQG) aims to generate natural language questions from a set of triplet facts extracted from KB. Existing methods have significantly boosted the performance of KBQG via pre-trained language models (PLMs) thanks to the richly endowed semantic knowledge. With the advance of pre-training techniques, large language models (LLMs) (e.g., GPT-3.5) undoubtedly possess much more semantic knowledge. Therefore, how to effectively organize and exploit the abundant knowledge for KBQG becomes the focus of our study. In this work, we propose SGSH--a simple and effective framework to Stimulate GPT-3.5 with Skeleton Heuristics to enhance KBQG. The framework incorporates "skeleton heuristics", which provides more fine-grained guidance associated with each input to stimulate LLMs to generate optimal questions, encompassing essential elements like the question phrase and the auxiliary verb.More specifically, we devise an automatic data construction strategy leveraging ChatGPT to construct a skeleton training dataset, based on which we employ a soft prompting approach to train a BART model dedicated to generating the skeleton associated with each input. Subsequently, skeleton heuristics are encoded into the prompt to incentivize GPT-3.5 to generate desired questions. Extensive experiments demonstrate that SGSH derives the new state-of-the-art performance on the KBQG tasks.
- [528] arXiv:2404.01925 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improving Bird's Eye View Semantic Segmentation by Task DecompositionTianhao Zhao , Yongcan Chen , Yu Wu , Tianyang Liu , Bo Du , Peilun Xiao , Shi Qiu , Hongda Yang , Guozhen Li , Yi Yang , Yutian LinComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at this https URL .
- [529] arXiv:2404.01954 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: HyperCLOVA X Technical ReportKang Min Yoo , Jaegeun Han , Sookyo In , Heewon Jeon , Jisu Jeong , Jaewook Kang , Hyunwook Kim , Kyung-Min Kim , Munhyong Kim , Sungju Kim , Donghyun Kwak , Hanock Kwak , Se Jung Kwon , Bado Lee , Dongsoo Lee , Gichang Lee , Jooho Lee , Baeseong Park , Seongjin Shin , Joonsang Yu , Seolki Baek , Sumin Byeon , Eungsup Cho , Dooseok Choe , Jeesung Han , Youngkyun Jin , Hyein Jun , Jaeseung Jung , Chanwoong Kim , Jinhong Kim , Jinuk Kim , Dokyeong Lee , Dongwook Park , Jeong Min Sohn , Sujung Han , Jiae Heo , Sungju Hong , Mina Jeon , Hyunhoon Jung , Jungeun Jung , Wangkyo Jung , Chungjoon Kim , Hyeri Kim , Jonghyun Kim , Min Young Kim , Soeun Lee , Joonhee Park , Jieun Shin , Sojin Yang , Jungsoon Yoon , Hwaran Lee , Sanghwan Bae , Jeehwan Cha , Karl Gylleus , Donghoon Ham , Mihak Hong , Youngki Hong , Yunki Hong , Dahyun Jang , Hyojun Jeon , Yujin Jeon , Yeji Jeong , Myunggeun Ji , Yeguk Jin , Chansong Jo , Shinyoung Joo , Seunghwan Jung , Adrian Jungmyung Kim , Byoung Hoon Kim , Hyomin Kim , Jungwhan Kim , Minkyoung Kim , Minseung Kim , Sungdong Kim , Yonghee Kim , Youngjun Kim , Youngkwan Kim , Donghyeon Ko , Dughyun Lee , Ha Young Lee , Jaehong Lee , Jieun Lee , Jonghyun Lee , Jongjin Lee , Min Young Lee , Yehbin Lee , Taehong Min , Yuri Min , Kiyoon Moon , Hyangnam Oh , Jaesun Park , Kyuyon Park , Younghun Park , Hanbae Seo , Seunghyun Seo , Mihyun Sim , Gyubin Son , Matt Yeo , Kyung Hoon Yeom , Wonjoon YooComments: 44 pages; updated authors list and fixed author namesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
- [530] arXiv:2404.01965 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural NetworksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep Learning (DL) has advanced various fields by extracting complex patterns from large datasets. However, the computational demands of DL models pose environmental and resource challenges. Deep shift neural networks (DSNNs) offer a solution by leveraging shift operations to reduce computational complexity at inference. Following the insights from standard DNNs, we are interested in leveraging the full potential of DSNNs by means of AutoML techniques. We study the impact of hyperparameter optimization (HPO) to maximize DSNN performance while minimizing resource consumption. Since this combines multi-objective (MO) optimization with accuracy and energy consumption as potentially complementary objectives, we propose to combine state-of-the-art multi-fidelity (MF) HPO with multi-objective optimization. Experimental results demonstrate the effectiveness of our approach, resulting in models with over 80\% in accuracy and low computational cost. Overall, our method accelerates efficient model development while enabling sustainable AI applications.
- [531] arXiv:2404.01976 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Joint-Task Regularization for Partially Labeled Multi-Task LearningComments: Accepted paper to CVPR 2024 (main conference)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Multi-task learning has become increasingly popular in the machine learning field, but its practicality is hindered by the need for large, labeled datasets. Most multi-task learning methods depend on fully labeled datasets wherein each input example is accompanied by ground-truth labels for all target tasks. Unfortunately, curating such datasets can be prohibitively expensive and impractical, especially for dense prediction tasks which require per-pixel labels for each image. With this in mind, we propose Joint-Task Regularization (JTR), an intuitive technique which leverages cross-task relations to simultaneously regularize all tasks in a single joint-task latent space to improve learning when data is not fully labeled for all tasks. JTR stands out from existing approaches in that it regularizes all tasks jointly rather than separately in pairs -- therefore, it achieves linear complexity relative to the number of tasks while previous methods scale quadratically. To demonstrate the validity of our approach, we extensively benchmark our method across a wide variety of partially labeled scenarios based on NYU-v2, Cityscapes, and Taskonomy.
- [532] arXiv:2404.01986 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Predicting the Intention to Interact with a Service Robot:the Role of Gaze CuesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: For a service robot, it is crucial to perceive as early as possible that an approaching person intends to interact: in this case, it can proactively enact friendly behaviors that lead to an improved user experience. We solve this perception task with a sequence-to-sequence classifier of a potential user intention to interact, which can be trained in a self-supervised way. Our main contribution is a study of the benefit of features representing the person's gaze in this context. Extensive experiments on a novel dataset show that the inclusion of gaze cues significantly improves the classifier performance (AUROC increases from 84.5% to 91.2%); the distance at which an accurate classification can be achieved improves from 2.4 m to 3.2 m. We also quantify the system's ability to adapt to new environments without external supervision. Qualitative experiments show practical applications with a waiter robot.
- [533] arXiv:2404.02018 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Large Language Models for Orchestrating Bimanual RobotsComments: The project website can be found at this http URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Although there has been rapid progress in endowing robots with the ability to solve complex manipulation tasks, generating control policies for bimanual robots to solve tasks involving two hands is still challenging because of the difficulties in effective temporal and spatial coordination. With emergent abilities in terms of step-by-step reasoning and in-context learning, Large Language Models (LLMs) have taken control of a variety of robotic tasks. However, the nature of language communication via a single sequence of discrete symbols makes LLM-based coordination in continuous space a particular challenge for bimanual tasks. To tackle this challenge for the first time by an LLM, we present LAnguage-model-based Bimanual ORchestration (LABOR), an agent utilizing an LLM to analyze task configurations and devise coordination control policies for addressing long-horizon bimanual tasks. In the simulated environment, the LABOR agent is evaluated through several everyday tasks on the NICOL humanoid robot. Reported success rates indicate that overall coordination efficiency is close to optimal performance, while the analysis of failure causes, classified into spatial and temporal coordination and skill selection, shows that these vary over tasks. The project website can be found at this http URL
- [534] arXiv:2404.02037 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MultiParaDetox: Extending Text Detoxification with Parallel Data to New LanguagesComments: Accepted to NAACL2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.
- [535] arXiv:2404.02043 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer ApproachesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.
- [536] arXiv:2404.02047 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Universal representations for financial transactional data: embracing local, global, and external contextsAlexandra Bazarova , Maria Kovaleva , Ilya Kuleshov , Evgenia Romanenkova , Alexander Stepikin , Alexandr Yugay , Dzhambulat Mollaev , Ivan Kireev , Andrey Savchenko , Alexey ZaytsevSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Effective processing of financial transactions is essential for banking data analysis. However, in this domain, most methods focus on specialized solutions to stand-alone problems instead of constructing universal representations suitable for many problems. We present a representation learning framework that addresses diverse business challenges. We also suggest novel generative models that account for data specifics, and a way to integrate external information into a client's representation, leveraging insights from other customers' actions. Finally, we offer a benchmark, describing representation quality globally, concerning the entire transaction history; locally, reflecting the client's current state; and dynamically, capturing representation evolution over time. Our generative approach demonstrates superior performance in local tasks, with an increase in ROC-AUC of up to 14\% for the next MCC prediction task and up to 46\% for downstream tasks from existing contrastive baselines. Incorporating external information improves the scores by an additional 20\%.
- [537] arXiv:2404.02060 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Long-context LLMs Struggle with Long In-context LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have made significant strides in handling long sequences exceeding 32K tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their abilities in more nuanced, real-world scenarios. This study introduces a specialized benchmark (LongICLBench) focusing on long in-context learning within the realm of extreme-label classification. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input (few-shot demonstration) lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate 13 long-context LLMs on our benchmarks. We find that the long-context LLMs perform relatively well on less challenging tasks with shorter demonstration lengths by effectively utilizing the long context window. However, on the most challenging task Discovery with 174 labels, all the LLMs struggle to understand the task definition, thus reaching a performance close to zero. This suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences. Further analysis revealed a tendency among models to favor predictions for labels presented toward the end of the sequence. Their ability to reason over multiple pieces in the long sequence is yet to be improved. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.
- [538] arXiv:2404.02062 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Digital Forgetting in Large Language Models: A Survey of Unlearning MethodsAlberto Blanco-Justicia , Najeeb Jebreel , Benet Manzanares , David Sánchez , Josep Domingo-Ferrer , Guillem Collell , Kuan Eeik TanComments: 70 pagesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The objective of digital forgetting is, given a model with undesirable knowledge or behavior, obtain a new model where the detected issues are no longer present. The motivations for forgetting include privacy protection, copyright protection, elimination of biases and discrimination, and prevention of harmful content generation. Effective digital forgetting has to be effective (meaning how well the new model has forgotten the undesired knowledge/behavior), retain the performance of the original model on the desirable tasks, and be scalable (in particular forgetting has to be more efficient than retraining from scratch on just the tasks/data to be retained). This survey focuses on forgetting in large language models (LLMs). We first provide background on LLMs, including their components, the types of LLMs, and their usual training pipeline. Second, we describe the motivations, types, and desired properties of digital forgetting. Third, we introduce the approaches to digital forgetting in LLMs, among which unlearning methodologies stand out as the state of the art. Fourth, we provide a detailed taxonomy of machine unlearning methods for LLMs, and we survey and compare current approaches. Fifth, we detail datasets, models and metrics used for the evaluation of forgetting, retaining and runtime. Sixth, we discuss challenges in the area. Finally, we provide some concluding remarks.
- [539] arXiv:2404.02063 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: SPMamba: State-space model is all you need in speech separationComments: Technical Report. Work in progress. Code is available at this https URLSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: In speech separation, both CNN- and Transformer-based models have demonstrated robust separation capabilities, garnering significant attention within the research community. However, CNN-based methods have limited modelling capability for long-sequence audio, leading to suboptimal separation performance. Conversely, Transformer-based methods are limited in practical applications due to their high computational complexity. Notably, within computer vision, Mamba-based methods have been celebrated for their formidable performance and reduced computational requirements. In this paper, we propose a network architecture for speech separation using a state-space model, namely SPMamba. We adopt the TF-GridNet model as the foundational framework and substitute its Transformer component with a bidirectional Mamba module, aiming to capture a broader range of contextual information. Our experimental results reveal an important role in the performance aspects of Mamba-based models. SPMamba demonstrates superior performance with a significant advantage over existing separation models in a dataset built on Librispeech. Notably, SPMamba achieves a substantial improvement in separation quality, with a 2.42 dB enhancement in SI-SNRi compared to the TF-GridNet. The source code for SPMamba is publicly accessible at this https URL .
- [540] arXiv:2404.02067 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Red-Teaming Segment Anything ModelKrzysztof Jankowski , Bartlomiej Sobieski , Mateusz Kwiatkowski , Jakub Szulc , Michal Janik , Hubert Baniecki , Przemyslaw BiecekComments: CVPR 2024 - The 4th Workshop of Adversarial Machine Learning on Computer Vision: Robustness of Foundation ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.
- [541] arXiv:2404.02090 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Already Moderate Population Sizes Provably Yield Strong Robustness to NoiseComments: Full version of the same-titled paper accepted at GECCO 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Experience shows that typical evolutionary algorithms can cope well with stochastic disturbances such as noisy function evaluations.
In this first mathematical runtime analysis of the $(1+\lambda)$ and $(1,\lambda)$ evolutionary algorithms in the presence of prior bit-wise noise, we show that both algorithms can tolerate constant noise probabilities without increasing the asymptotic runtime on the OneMax benchmark. For this, a population size $\lambda$ suffices that is at least logarithmic in the problem size $n$. The only previous result in this direction regarded the less realistic one-bit noise model, required a population size super-linear in the problem size, and proved a runtime guarantee roughly cubic in the noiseless runtime for the OneMax benchmark. Our significantly stronger results are based on the novel proof argument that the noiseless offspring can be seen as a biased uniform crossover between the parent and the noisy offspring. We are optimistic that the technical lemmas resulting from this insight will find applications also in future mathematical runtime analyses of evolutionary algorithms. - [542] arXiv:2404.02127 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal ReasoningJoel Niklaus , Lucia Zheng , Arya D. McCarthy , Christopher Hahn , Brian M. Rosen , Peter Henderson , Daniel E. Ho , Garrett Honke , Percy Liang , Christopher ManningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictions, 24 languages and a total of 12M examples. We present evidence that domain-specific pretraining and instruction tuning improve performance on LegalBench, including improving Flan-T5 XL by 8 points or 16\% over the baseline. However, the effect does not generalize across all tasks, training regimes, model sizes, and other factors. LawInstruct is a resource for accelerating the development of models with stronger information processing and decision making capabilities in the legal domain.
- [543] arXiv:2404.02151 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve nearly 100\% attack success rate -- according to GPT-4 as a judge -- on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with 100\% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). We provide the code, prompts, and logs of the attacks at this https URL .
- [544] arXiv:2404.02157 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Segment Any 3D Object with LanguageComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even close to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.
- [545] arXiv:2404.02174 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Bounds of Block Rewards in Honest PinFi SystemsSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract: PinFi is a class of novel protocols for decentralized pricing of dissipative assets, whose value naturally declines over time. Central to the protocol's functionality and its market efficiency is the role of liquidity providers (LPs). This study addresses critical stability and sustainability challenges within the protocol, namely: the propensity of LPs to prefer selling in external markets over participation in the protocol; a similar inclination towards selling within the PinFi system rather than contributing as LPs; and a scenario where LPs are disinclined to sell within the protocol. Employing a game-theoretic approach, we explore PinFi's mechanisms and its broader ramifications. Our findings reveal that, under a variety of common conditions and with an assumption of participant integrity, PinFi is capable of fostering a dynamic equilibrium among LPs, sellers, and buyers. This balance is maintained through a carefully calibrated range of block rewards for LPs, ensuring the protocol's long-term stability and utility.
- [546] arXiv:2404.02176 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Versatile Navigation under Partial Observability via Value-guided Diffusion PolicyComments: 13 pages, 7 figures, CVPR 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Route planning for navigation under partial observability plays a crucial role in modern robotics and autonomous driving. Existing route planning approaches can be categorized into two main classes: traditional autoregressive and diffusion-based methods. The former often fails due to its myopic nature, while the latter either assumes full observability or struggles to adapt to unfamiliar scenarios, due to strong couplings with behavior cloning from experts. To address these deficiencies, we propose a versatile diffusion-based approach for both 2D and 3D route planning under partial observability. Specifically, our value-guided diffusion policy first generates plans to predict actions across various timesteps, providing ample foresight to the planning. It then employs a differentiable planner with state estimations to derive a value function, directing the agent's exploration and goal-seeking behaviors without seeking experts while explicitly addressing partial observability. During inference, our policy is further enhanced by a best-plan-selection strategy, substantially boosting the planning success rate. Moreover, we propose projecting point clouds, derived from RGB-D inputs, onto 2D grid-based bird-eye-view maps via semantic segmentation, generalizing to 3D environments. This simple yet effective adaption enables zero-shot transfer from 2D-trained policy to 3D, cutting across the laborious training for 3D policy, and thus certifying our versatility. Experimental results demonstrate our superior performance, particularly in navigating situations beyond expert demonstrations, surpassing state-of-the-art autoregressive and diffusion-based baselines for both 2D and 3D scenarios.
- [547] arXiv:2404.02179 (cross-list from cs.IT) [ pdf , ps , html , other ]
-
Title: Distributed and Rate-Adaptive Feature CompressionSubjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: We study the problem of distributed and rate-adaptive feature compression for linear regression. A set of distributed sensors collect disjoint features of regressor data. A fusion center is assumed to contain a pretrained linear regression model, trained on a dataset of the entire uncompressed data. At inference time, the sensors compress their observations and send them to the fusion center through communication-constrained channels, whose rates can change with time. Our goal is to design a feature compression {scheme} that can adapt to the varying communication constraints, while maximizing the inference performance at the fusion center. We first obtain the form of optimal quantizers assuming knowledge of underlying regressor data distribution. Under a practically reasonable approximation, we then propose a distributed compression scheme which works by quantizing a one-dimensional projection of the sensor data. We also propose a simple adaptive scheme for handling changes in communication constraints. We demonstrate the effectiveness of the distributed adaptive compression scheme through simulated experiments.
- [548] arXiv:2404.02180 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Remote sensing framework for geological mapping via stacked autoencoders and clusteringSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Supervised learning methods for geological mapping via remote sensing face limitations due to the scarcity of accurately labelled training data. In contrast, unsupervised learning methods, such as dimensionality reduction and clustering have the ability to uncover patterns and structures in remote sensing data without relying on predefined labels. Dimensionality reduction methods have the potential to play a crucial role in improving the accuracy of geological maps. Although conventional dimensionality reduction methods may struggle with nonlinear data, unsupervised deep learning models such as autoencoders have the ability to model nonlinear relationship in data. Stacked autoencoders feature multiple interconnected layers to capture hierarchical data representations that can be useful for remote sensing data. In this study, we present an unsupervised machine learning framework for processing remote sensing data by utilizing stacked autoencoders for dimensionality reduction and k-means clustering for mapping geological units. We use the Landsat-8, ASTER, and Sentinel-2 datasets of the Mutawintji region in Western New South Wales, Australia to evaluate the framework for geological mapping. We also provide a comparison of stacked autoencoders with principal component analysis and canonical autoencoders. Our results reveal that the framework produces accurate and interpretable geological maps, efficiently discriminating rock units. We find that the stacked autoencoders provide better accuracy when compared to the counterparts. We also find that the generated maps align with prior geological knowledge of the study area while providing novel insights into geological structures.
- [549] arXiv:2404.02181 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Leveraging Machine Learning for Early Autism Detection via INDT-ASD Indian DatabaseSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Machine learning (ML) has advanced quickly, particularly throughout the area of health care. The diagnosis of neurodevelopment problems using ML is a very important area of healthcare. Autism spectrum disorder (ASD) is one of the developmental disorders that is growing the fastest globally. The clinical screening tests used to identify autistic symptoms are expensive and time-consuming. But now that ML has been advanced, it's feasible to identify autism early on. Previously, many different techniques have been used in investigations. Still, none of them have produced the anticipated outcomes when it comes to the capacity to predict autistic features utilizing a clinically validated Indian ASD database. Therefore, this study aimed to develop a simple, quick, and inexpensive technique for identifying ASD by using ML. Various machine learning classifiers, including Adaboost (AB), Gradient Boost (GB), Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), Gaussian Naive Bayes (GNB), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM), were used to develop the autism prediction model. The proposed method was tested with records from the AIIMS Modified INDT-ASD (AMI) database, which were collected through an application developed by AIIMS in Delhi, India. Feature engineering has been applied to make the proposed solution easier than already available solutions. Using the proposed model, we succeeded in predicting ASD using a minimized set of 20 questions rather than the 28 questions presented in AMI with promising accuracy. In a comparative evaluation, SVM emerged as the superior model among others, with 100 $\pm$ 0.05\% accuracy, higher recall by 5.34\%, and improved accuracy by 2.22\%-6.67\% over RF. We have also introduced a web-based solution supporting both Hindi and English.
- [550] arXiv:2404.02183 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and OptimizationSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: Recent advancements in automatic code generation using large language model (LLM) agent have brought us closer to the future of automated software development. However, existing single-agent approaches face limitations in generating and improving large-scale, complex codebases due to constraints in context length. To tackle this challenge, we propose Self-Organized multi-Agent framework (SoA), a novel multi-agent framework that enables the scalable and efficient generation and optimization of large-scale code. In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. A key feature of our framework is the automatic multiplication of agents based on problem complexity, allowing for dynamic scalability. This enables the overall code volume to be increased indefinitely according to the number of agents, while the amount of code managed by each agent remains constant. We evaluate SoA on the HumanEval benchmark and demonstrate that, compared to a single-agent system, each agent in SoA handles significantly less code, yet the overall generated code is substantially greater. Moreover, SoA surpasses the powerful single-agent baseline by 5% in terms of Pass@1 accuracy.
- [551] arXiv:2404.02187 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced DataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods.
- [552] arXiv:2404.02189 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Insights from the Use of Previously Unseen Neural Architecture Search DatasetsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The boundless possibility of neural networks which can be used to solve a problem -- each with different performance -- leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However, to date, NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST, Language, MultNIST, CIFARTile, Gutenberg, Isabella, GeoClassing, and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants.
- [553] arXiv:2404.02205 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: A Holistic Indicator of Polarization to Measure Online SexismSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: The online trend of the manosphere and feminist discourse on social networks requires a holistic measure of the level of sexism in an online community. This indicator is important for policymakers and moderators of online communities (e.g., subreddits) and computational social scientists, either to revise moderation strategies based on the degree of sexism or to match and compare the temporal sexism across different platforms and communities with real-time events and infer social scientific insights.
In this paper, we build a model that can provide a comparable holistic indicator of toxicity targeted toward male and female identity and male and female individuals. Despite previous supervised NLP methods that require annotation of toxic comments at the target level (e.g. annotating comments that are specifically toxic toward women) to detect targeted toxic comments, our indicator uses supervised NLP to detect the presence of toxicity and unsupervised word embedding association test to detect the target automatically.
We apply our model to gender discourse communities (e.g., r/TheRedPill, r/MGTOW, r/FemaleDatingStrategy) to detect the level of toxicity toward genders (i.e., sexism). Our results show that our framework accurately and consistently (93% correlation) measures the level of sexism in a community. We finally discuss how our framework can be generalized in the future to measure qualities other than toxicity (e.g. sentiment, humor) toward general-purpose targets and turn into an indicator of different sorts of polarizations. - [554] arXiv:2404.02213 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Exploring How Multiple Levels of GPT-Generated Programming Hints Support or Disappoint NovicesComments: Accepted CHI 2024 LBW - 10 pagesSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Recent studies have integrated large language models (LLMs) into diverse educational contexts, including providing adaptive programming hints, a type of feedback focuses on helping students move forward during problem-solving. However, most existing LLM-based hint systems are limited to one single hint type. To investigate whether and how different levels of hints can support students' problem-solving and learning, we conducted a think-aloud study with 12 novices using the LLM Hint Factory, a system providing four levels of hints from general natural language guidance to concrete code assistance, varying in format and granularity. We discovered that high-level natural language hints alone can be helpless or even misleading, especially when addressing next-step or syntax-related help requests. Adding lower-level hints, like code examples with in-line comments, can better support students. The findings open up future work on customizing help responses from content, format, and granularity levels to accurately identify and meet students' learning needs.
- [555] arXiv:2404.02225 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CHOSEN: Contrastive Hypothesis Selection for Multi-View Depth RefinementDi Qiu , Yinda Zhang , Thabo Beeler , Vladimir Tankovich , Christian Häne , Sean Fanello , Christoph Rhemann , Sergio Orts EscolanoSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose CHOSEN, a simple yet flexible, robust and effective multi-view depth refinement framework. It can be employed in any existing multi-view stereo pipeline, with straightforward generalization capability for different multi-view capture systems such as camera relative positioning and lenses. Given an initial depth estimation, CHOSEN iteratively re-samples and selects the best hypotheses, and automatically adapts to different metric or intrinsic scales determined by the capture system. The key to our approach is the application of contrastive learning in an appropriate solution space and a carefully designed hypothesis feature, based on which positive and negative hypotheses can be effectively distinguished. Integrated in a simple baseline multi-view stereo pipeline, CHOSEN delivers impressive quality in terms of depth and normal accuracy compared to many current deep learning based multi-view stereo pipelines.
- [556] arXiv:2404.02227 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning DenoisingComments: In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (CVPR)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Trajectory prediction is fundamental in computer vision and autonomous driving, particularly for understanding pedestrian behavior and enabling proactive decision-making. Existing approaches in this field often assume precise and complete observational data, neglecting the challenges associated with out-of-view objects and the noise inherent in sensor data due to limited camera range, physical obstructions, and the absence of ground truth for denoised sensor data. Such oversights are critical safety concerns, as they can result in missing essential, non-visible objects. To bridge this gap, we present a novel method for out-of-sight trajectory prediction that leverages a vision-positioning technique. Our approach denoises noisy sensor observations in an unsupervised manner and precisely maps sensor-based trajectories of out-of-sight objects into visual trajectories. This method has demonstrated state-of-the-art performance in out-of-sight noisy sensor trajectory denoising and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory prediction accuracy and addressing the challenges of out-of-sight objects, our work significantly contributes to improving the safety and reliability of autonomous driving in complex environments. Our work represents the first initiative towards Out-Of-Sight Trajectory prediction (OOSTraj), setting a new benchmark for future research. The code is available at \url{ this https URL }.
- [557] arXiv:2404.02235 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Is Exploration All You Need? Effective Exploration Characteristics for Transfer in Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In deep reinforcement learning (RL) research, there has been a concerted effort to design more efficient and productive exploration methods while solving sparse-reward problems. These exploration methods often share common principles (e.g., improving diversity) and implementation details (e.g., intrinsic reward). Prior work found that non-stationary Markov decision processes (MDPs) require exploration to efficiently adapt to changes in the environment with online transfer learning. However, the relationship between specific exploration characteristics and effective transfer learning in deep RL has not been characterized. In this work, we seek to understand the relationships between salient exploration characteristics and improved performance and efficiency in transfer learning. We test eleven popular exploration algorithms on a variety of transfer types -- or ``novelties'' -- to identify the characteristics that positively affect online transfer learning. Our analysis shows that some characteristics correlate with improved performance and efficiency across a wide range of transfer tasks, while others only improve transfer performance with respect to specific environment changes. From our analysis, make recommendations about which exploration algorithm characteristics are best suited to specific transfer situations.
- [558] arXiv:2404.02249 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: RAT: Retrieval-Augmented Transformer for Click-Through Rate PredictionComments: Accepted to The ACM Web Conference 2024 (WWW'24, short paper). Data and code are availableSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Predicting click-through rates (CTR) is a fundamental task for Web applications, where a key issue is to devise effective models for feature interactions. Current methodologies predominantly concentrate on modeling feature interactions within an individual sample, while overlooking the potential cross-sample relationships that can serve as a reference context to enhance the prediction. To make up for such deficiency, this paper develops a Retrieval-Augmented Transformer (RAT), aiming to acquire fine-grained feature interactions within and across samples. By retrieving similar samples, we construct augmented input for each target sample. We then build Transformer layers with cascaded attention to capture both intra- and cross-sample feature interactions, facilitating comprehensive reasoning for improved CTR prediction while retaining efficiency. Extensive experiments on real-world datasets substantiate the effectiveness of RAT and suggest its advantage in long-tail scenarios. The code has been open-sourced at \url{ this https URL }.
- [559] arXiv:2404.02254 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: On Stronger Computational Separations Between Multimodal and Unimodal Machine LearningSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In multimodal machine learning, multiple modalities of data (e.g., text and images) are combined to facilitate the learning of a better machine learning model, which remains applicable to a corresponding unimodal task (e.g., text generation). Recently, multimodal machine learning has enjoyed huge empirical success (e.g. GPT-4). Motivated to develop theoretical justification for this empirical success, Lu (NeurIPS '23, ALT '24) introduces a theory of multimodal learning, and considers possible separations between theoretical models of multimodal and unimodal learning. In particular, Lu (ALT '24) shows a computational separation, which is relevant to worst-case instances of the learning task.
In this paper, we give a stronger average-case computational separation, where for "typical" instances of the learning task, unimodal learning is computationally hard, but multimodal learning is easy. We then question how "organic" the average-case separation is. Would it be encountered in practice? To this end, we prove that under natural conditions, any given computational separation between average-case unimodal and multimodal learning tasks implies a corresponding cryptographic key agreement protocol. We suggest to interpret this as evidence that very strong computational advantages of multimodal learning may arise infrequently in practice, since they exist only for the "pathological" case of inherently cryptographic distributions. However, this does not apply to possible (super-polynomial) statistical advantages. - [560] arXiv:2404.02255 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: $\texttt{LM}^\texttt{2}$: A Simple Society of Language Models Solves Complex ReasoningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite demonstrating emergent reasoning abilities, Large Language Models (LLMS) often lose track of complex, multi-step reasoning. Existing studies show that providing guidance via decomposing the original question into multiple subproblems elicits more robustness in LLM reasoning -- a decomposer generates the subproblems, and a solver solves each of these subproblems. However, these techniques fail to accommodate coordination between the decomposer and the solver modules (either in a single model or different specialized ones) -- the decomposer does not keep track of the ability of the solver to follow the decomposed reasoning. In this paper, we propose LM2 to address these challenges. LM2 modularizes the decomposition, solution, and verification into three different language models. The decomposer module identifies the key concepts necessary to solve the problem and generates step-by-step subquestions according to the reasoning requirement. The solver model generates the solution to the subproblems that are then checked by the verifier module; depending upon the feedback from the verifier, the reasoning context is constructed using the subproblems and the solutions. These models are trained to coordinate using policy learning. Exhaustive experimentation suggests the superiority of LM2 over existing methods on in- and out-domain reasoning problems, outperforming the best baselines by $8.1\%$ on MATH, $7.71\%$ on JEEBench, and $9.7\%$ on MedQA problems (code available at this https URL ).
- [561] arXiv:2404.02261 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource LanguagesComments: 19 pages, 6 tables. The source code related to this paper is available at https://anonymous.4open.science/r/llms-in-the-loop-4E32Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
- [562] arXiv:2404.02263 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OFMPNet: Deep End-to-End Model for Occupancy and Flow Prediction in Urban EnvironmentComments: Accepted in Neurocomputing journal - 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: The task of motion prediction is pivotal for autonomous driving systems, providing crucial data to choose a vehicle behavior strategy within its surroundings. Existing motion prediction techniques primarily focus on predicting the future trajectory of each agent in the scene individually, utilizing its past trajectory data. In this paper, we introduce an end-to-end neural network methodology designed to predict the future behaviors of all dynamic objects in the environment. This approach leverages the occupancy map and the scene's motion flow. We are investigatin various alternatives for constructing a deep encoder-decoder model called OFMPNet. This model uses a sequence of bird's-eye-view road images, occupancy grid, and prior motion flow as input data. The encoder of the model can incorporate transformer, attention-based, or convolutional units. The decoder considers the use of both convolutional modules and recurrent blocks. Additionally, we propose a novel time-weighted motion flow loss, whose application has shown a substantial decrease in end-point error. Our approach has achieved state-of-the-art results on the Waymo Occupancy and Flow Prediction benchmark, with a Soft IoU of 52.1% and an AUC of 76.75% on Flow-Grounded Occupancy.
- [563] arXiv:2404.02269 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Extracting Norms from Contracts Via ChatGPT: Opportunities and ChallengesComments: Accepted at COINE-AAMAS 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We investigate the effectiveness of ChatGPT in extracting norms from contracts. Norms provide a natural way to engineer multiagent systems by capturing how to govern the interactions between two or more autonomous parties. We extract norms of commitment, prohibition, authorization, and power, along with associated norm elements (the parties involved, antecedents, and consequents) from contracts. Our investigation reveals ChatGPT's effectiveness and limitations in norm extraction from contracts. ChatGPT demonstrates promising performance in norm extraction without requiring training or fine-tuning, thus obviating the need for annotated data, which is not generally available in this domain. However, we found some limitations of ChatGPT in extracting these norms that lead to incorrect norm extractions. The limitations include oversight of crucial details, hallucination, incorrect parsing of conjunctions, and empty norm elements. Enhanced norm extraction from contracts can foster the development of more transparent and trustworthy formal agent interaction specifications, thereby contributing to the improvement of multiagent systems.
- [564] arXiv:2404.02287 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: One Noise to Rule Them All: Multi-View Adversarial Attacks with Universal PerturbationComments: 6 pages, 4 figures, presented at ICAIA, Springer to publish under Algorithms for Intelligent SystemsJournal-ref: 2nd International Conference on Artificial Intelligence and Applications (ICAIA 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a novel universal perturbation method for generating robust multi-view adversarial examples in 3D object recognition. Unlike conventional attacks limited to single views, our approach operates on multiple 2D images, offering a practical and scalable solution for enhancing model scalability and robustness. This generalizable method bridges the gap between 2D perturbations and 3D-like attack capabilities, making it suitable for real-world applications.
Existing adversarial attacks may become ineffective when images undergo transformations like changes in lighting, camera position, or natural deformations. We address this challenge by crafting a single universal noise perturbation applicable to various object views. Experiments on diverse rendered 3D objects demonstrate the effectiveness of our approach. The universal perturbation successfully identified a single adversarial noise for each given set of 3D object renders from multiple poses and viewpoints. Compared to single-view attacks, our universal attacks lower classification confidence across multiple viewing angles, especially at low noise levels. A sample implementation is made available at this https URL . - [565] arXiv:2404.02304 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Virtual Sensor for Real-Time Bearing Load Prediction Using Heterogeneous Temporal Graph Neural NetworksComments: 8 pages, 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Accurate bearing load monitoring is essential for their Prognostics and Health Management (PHM), enabling damage assessment, wear prediction, and proactive maintenance. While bearing sensors are typically placed on the bearing housing, direct load monitoring requires sensors inside the bearing itself. Recently introduced sensor rollers enable direct bearing load monitoring but are constrained by their battery life. Data-driven virtual sensors can learn from sensor roller data collected during a batterys lifetime to map operating conditions to bearing loads. Although spatially distributed bearing sensors offer insights into load distribution (e.g., correlating temperature with load), traditional machine learning algorithms struggle to fully exploit these spatial-temporal dependencies. To address this gap, we introduce a graph-based virtual sensor that leverages Graph Neural Networks (GNNs) to analyze spatial-temporal dependencies among sensor signals, mapping existing measurements (temperature, vibration) to bearing loads. Since temperature and vibration signals exhibit vastly different dynamics, we propose Heterogeneous Temporal Graph Neural Networks (HTGNN), which explicitly models these signal types and their interactions for effective load prediction. Our results demonstrate that HTGNN outperforms Convolutional Neural Networks (CNNs), which struggle to capture both spatial and heterogeneous signal characteristics. These findings highlight the importance of capturing the complex spatial interactions between temperature, vibration, and load.
- [566] arXiv:2404.02305 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Collapse of Self-trained Language ModelsComments: ICLR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our research reveals its practical limitations. We find that extended self-training of the GPT-2 model leads to a significant degradation in performance, resulting in repetitive and collapsed token output.
- [567] arXiv:2404.02314 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Is Meta-training Really Necessary for Molecular Few-Shot Learning ?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Few-shot learning has recently attracted significant interest in drug discovery, with a recent, fast-growing literature mostly involving convoluted meta-learning strategies. We revisit the more straightforward fine-tuning approach for molecular data, and propose a regularized quadratic-probe loss based on the the Mahalanobis distance. We design a dedicated block-coordinate descent optimizer, which avoid the degenerate solutions of our loss. Interestingly, our simple fine-tuning approach achieves highly competitive performances in comparison to state-of-the-art methods, while being applicable to black-box settings and removing the need for specific episodic pre-training strategies. Furthermore, we introduce a new benchmark to assess the robustness of the competing methods to domain shifts. In this setting, our fine-tuning baseline obtains consistently better results than meta-learning methods.
- [568] arXiv:2404.02319 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt OptimizationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) can now handle longer and more complex inputs, which facilitate the use of more elaborate prompts. However, prompts often require some tuning to improve performance for deployment. Recent work has proposed automatic prompt optimization methods, but as prompt complexity and LLM strength increase, many prompt optimization techniques are no longer sufficient and a new approach is needed to optimize {\em meta prompt programs}. To address this, we introduce SAMMO, a framework for {\em compile-time} optimizations of metaprompt programs, which represent prompts as structured objects that allows for a rich set of transformations that can be searched over during optimization. We show that SAMMO generalizes previous methods and improves the performance of complex prompts on (1) instruction tuning, (2) RAG pipeline tuning, and (3) prompt compression, across several different LLMs.
We make all code available open-source at this https URL . - [569] arXiv:2404.02330 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Comparative Study of Domain Driven Terms Extraction Using Large Language ModelsSandeep Chataut , Tuyen Do , Bichar Dip Shrestha Gurung , Shiva Aryal , Anup Khanal , Carol Lushbough , Etienne GnimpiebaSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Keywords play a crucial role in bridging the gap between human understanding and machine processing of textual data. They are essential to data enrichment because they form the basis for detailed annotations that provide a more insightful and in-depth view of the underlying data. Keyword/domain driven term extraction is a pivotal task in natural language processing, facilitating information retrieval, document summarization, and content categorization. This review focuses on keyword extraction methods, emphasizing the use of three major Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We employed a custom Python package to interface with these LLMs, simplifying keyword extraction. Our study, utilizing the Inspec and PubMed datasets, evaluates the performance of these models. The Jaccard similarity index was used for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for GPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This paper underlines the role of prompt engineering in LLMs for better keyword extraction and discusses the impact of hallucination in LLMs on result evaluation. It also sheds light on the challenges in using LLMs for keyword extraction, including model complexity, resource demands, and optimization techniques.
- [570] arXiv:2404.02335 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multi-BERT: Leveraging Adapters and Prompt Tuning for Low-Resource Multi-Domain AdaptationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapid expansion of texts' volume and diversity presents formidable challenges in multi-domain settings. These challenges are also visible in the Persian name entity recognition (NER) settings. Traditional approaches, either employing a unified model for multiple domains or individual models for each domain, frequently pose significant limitations. Single models often struggle to capture the nuances of diverse domains, while utilizing multiple large models can lead to resource constraints, rendering the training of a model for each domain virtually impractical. Therefore, this paper introduces a novel approach composed of one core model with multiple sets of domain-specific parameters. We utilize techniques such as prompt tuning and adapters, combined with the incorporation of additional layers, to add parameters that we can train for the specific domains. This enables the model to perform comparably to individual models for each domain. Experimental results on different formal and informal datasets show that by employing these added parameters, the proposed model significantly surpasses existing practical models in performance. Remarkably, the proposed model requires only one instance for training and storage, yet achieves outstanding results across all domains, even surpassing the state-of-the-art in some. Moreover, we analyze each adaptation strategy, delineating its strengths, weaknesses, and optimal hyper-parameters for the Persian NER settings. Finally, we introduce a document-based domain detection pipeline tailored for scenarios with unknown text domains, enhancing the adaptability and practicality of this paper in real-world applications.
- [571] arXiv:2404.02353 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Semantic Augmentation in Images using LanguageSahiti Yerramilli , Jayant Sravan Tamarapalli , Tanmay Girish Kulkarni , Jonathan Francis , Eric NybergSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to train these diffusion models, we propose a technique to utilize generated images to augment existing datasets. This paper explores various strategies for effective data augmentation to improve the out-of-domain generalization capabilities of deep learning models.
- [572] arXiv:2404.02361 (cross-list from cs.MA) [ pdf , ps , other ]
-
Title: EnergAIze: Multi Agent Deep Deterministic Policy Gradient for Vehicle to Grid Energy ManagementComments: 6 pages, 6 figures, 2 tablesSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI)
Abstract: This paper investigates the increasing roles of Renewable Energy Sources (RES) and Electric Vehicles (EVs). While indicating a new era of sustainable energy, these also introduce complex challenges, including the need to balance supply and demand and smooth peak consumptions amidst rising EV adoption rates. Addressing these challenges requires innovative solutions such as Demand Response (DR), energy flexibility management, Renewable Energy Communities (RECs), and more specifically for EVs, Vehicle-to-Grid (V2G). However, existing V2G approaches often fall short in real-world adaptability, global REC optimization with other flexible assets, scalability, and user engagement. To bridge this gap, this paper introduces EnergAIze, a Multi-Agent Reinforcement Learning (MARL) energy management framework, leveraging the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. EnergAIze enables user-centric and multi-objective energy management by allowing each prosumer to select from a range of personal management objectives, thus encouraging engagement. Additionally, it architects' data protection and ownership through decentralized computing, where each prosumer can situate an energy management optimization node directly at their own dwelling. The local node not only manages local energy assets but also fosters REC wide optimization. The efficacy of EnergAIze was evaluated through case studies employing the CityLearn simulation framework. These simulations were instrumental in demonstrating EnergAIze's adeptness at implementing V2G technology within a REC and other energy assets. The results show reduction in peak loads, ramping, carbon emissions, and electricity costs at the REC level while optimizing for individual prosumers objectives.
- [573] arXiv:2404.02370 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze PatternsComments: Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent advancements in Computer Assisted Diagnosis have shown promising performance in medical imaging tasks, particularly in chest X-ray analysis. However, the interaction between these models and radiologists has been primarily limited to input images. This work proposes a novel approach to enhance human-computer interaction in chest X-ray analysis using Vision-Language Models (VLMs) enhanced with radiologists' attention by incorporating eye gaze data alongside textual prompts. Our approach leverages heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist's focus during chest X-ray evaluation. We evaluate this methodology in tasks such as visual question answering, chest X-ray report automation, error detection, and differential diagnosis. Our results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis. Also, the impact of eye gaze on fine-tuning was confirmed as it outperformed other medical VLMs in all tasks except visual question answering. This work marks the potential of leveraging both the VLM's capabilities and the radiologist's domain knowledge to improve the capabilities of AI models in medical imaging, paving a novel way for Computer Assisted Diagnosis with a human-centred AI.
- [574] arXiv:2404.02389 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQLComments: to appear at NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Structured data, prevalent in tables, databases, and knowledge graphs, poses a significant challenge in its representation. With the advent of large language models (LLMs), there has been a shift towards linearization-based methods, which process structured data as sequential token streams, diverging from approaches that explicitly model structure, often as a graph. Crucially, there remains a gap in our understanding of how these linearization-based methods handle structured data, which is inherently non-linear. This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model's ability to mimic human-designed processes such as schema linking and syntax prediction, indicating a deep, meaningful learning of structure beyond simple token sequencing. We also uncover insights into the model's internal mechanisms, including the ego-centric nature of structure node encodings and the potential for model compression due to modality fusion redundancy. Overall, this work sheds light on the inner workings of linearization-based methods and could potentially provide guidance for future research.
- [575] arXiv:2404.02402 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Token Trails: Navigating Contextual Depths in Conversational AI with ChatLLMMd. Kowsher , Ritesh Panditi , Nusrat Jahan Prottasha , Prakash Bhat , Anupam Kumar Bairagi , Mohammad Shamsul ArefinSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Conversational modeling using Large Language Models (LLMs) requires a nuanced understanding of context to generate coherent and contextually relevant responses. In this paper, we present Token Trails, a novel approach that leverages token-type embeddings to navigate the intricate contextual nuances within conversations. Our framework utilizes token-type embeddings to distinguish between user utterances and bot responses, facilitating the generation of context-aware replies. Through comprehensive experimentation and evaluation, we demonstrate the effectiveness of Token Trails in improving conversational understanding and response generation, achieving state-of-the-art performance. Our results highlight the significance of contextual modeling in conversational AI and underscore the promising potential of Token Trails to advance the field, paving the way for more sophisticated and contextually aware chatbot interactions.
- [576] arXiv:2404.02406 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Exploring Backdoor Vulnerabilities of Chat ModelsComments: Code and data are available at this https URLSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor can not be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic content.
- [577] arXiv:2404.02407 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Decision Transformer as a Foundation Model for Partially Observable Continuous ControlComments: Submitted to CDC 2024Subjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Closed-loop control of nonlinear dynamical systems with partial-state observability demands expert knowledge of a diverse, less standardized set of theoretical tools. Moreover, it requires a delicate integration of controller and estimator designs to achieve the desired system behavior. To establish a general controller synthesis framework, we explore the Decision Transformer (DT) architecture. Specifically, we first frame the control task as predicting the current optimal action based on past observations, actions, and rewards, eliminating the need for a separate estimator design. Then, we leverage the pre-trained language models, i.e., the Generative Pre-trained Transformer (GPT) series, to initialize DT and subsequently train it for control tasks using low-rank adaptation (LoRA). Our comprehensive experiments across five distinct control tasks, ranging from maneuvering aerospace systems to controlling partial differential equations (PDEs), demonstrate DT's capability to capture the parameter-agnostic structures intrinsic to control tasks. DT exhibits remarkable zero-shot generalization abilities for completely new tasks and rapidly surpasses expert performance levels with a minimal amount of demonstration data. These findings highlight the potential of DT as a foundational controller for general control applications.
- [578] arXiv:2404.02418 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Auxiliary task demands mask the capabilities of smaller language modelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Developmental psychologists have argued about when cognitive capacities such as language understanding or theory of mind emerge. These debates often hinge on the concept of "task demands" -- the auxiliary challenges associated with performing a particular evaluation -- that may mask the child's underlying ability. The same issues arise when measuring the capacities of language models (LMs): performance on a task is a function of the model's underlying competence, combined with the model's ability to interpret and perform the task given its available resources. Here, we show that for analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. This "demand gap" is most pronounced for models with fewer parameters and less training data. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence (or lack thereof), but as a reflection of capacities seen through the lens of researchers' design choices.
- [579] arXiv:2404.02429 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: AD4RL: Autonomous Driving Benchmarks for Offline Reinforcement Learning with Value-based DatasetComments: ICRA 2024 Website at: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline reinforcement learning has emerged as a promising technology by enhancing its practicality through the use of pre-collected large datasets. Despite its practical benefits, most algorithm development research in offline reinforcement learning still relies on game tasks with synthetic datasets. To address such limitations, this paper provides autonomous driving datasets and benchmarks for offline reinforcement learning research. We provide 19 datasets, including real-world human driver's datasets, and seven popular offline reinforcement learning algorithms in three realistic driving scenarios. We also provide a unified decision-making process model that can operate effectively across different scenarios, serving as a reference framework in algorithm design. Our research lays the groundwork for further collaborations in the community to explore practical aspects of existing reinforcement learning methods. Dataset and codes can be found in this https URL .
- [580] arXiv:2404.02444 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in EducationComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers' expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers' utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
- [581] arXiv:2404.02447 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: A Novel Approach to Breast Cancer Histopathological Image Classification Using Cross-Colour Space Feature Fusion and Quantum-Classical Stack Ensemble MethodSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Breast cancer classification stands as a pivotal pillar in ensuring timely diagnosis and effective treatment. This study with histopathological images underscores the profound significance of harnessing the synergistic capabilities of colour space ensembling and quantum-classical stacking to elevate the precision of breast cancer classification. By delving into the distinct colour spaces of RGB, HSV and CIE L*u*v, the authors initiated a comprehensive investigation guided by advanced methodologies. Employing the DenseNet121 architecture for feature extraction the authors have capitalized on the robustness of Random Forest, SVM, QSVC, and VQC classifiers. This research encompasses a unique feature fusion technique within the colour space ensemble. This approach not only deepens our comprehension of breast cancer classification but also marks a milestone in personalized medical assessment. The amalgamation of quantum and classical classifiers through stacking emerges as a potent catalyst, effectively mitigating the inherent constraints of individual classifiers, paving a robust path towards more dependable and refined breast cancer identification. Through rigorous experimentation and meticulous analysis, fusion of colour spaces like RGB with HSV and RGB with CIE L*u*v, presents an classification accuracy, nearing the value of unity. This underscores the transformative potential of our approach, where the fusion of diverse colour spaces and the synergy of quantum and classical realms converge to establish a new horizon in medical diagnostics. Thus the implications of this research extend across medical disciplines, offering promising avenues for advancing diagnostic accuracy and treatment efficacy.
- [582] arXiv:2404.02448 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Electric Vehicle Routing Problem for Emergency Power Supply: Towards Telecom Base Station ReliefComments: Accepted at AAMAS 2024 (extended abstract). 10 pages, 5 figures. Work in progressSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: As a telecom provider, our company has a critical mission to maintain telecom services even during power outages. To accomplish the mission, it is essential to maintain the power of the telecom base stations. Here we consider a solution where electric vehicles (EVs) directly supply power to base stations by traveling to their locations. The goal is to find EV routes that minimize both the total travel distance of all EVs and the number of downed base stations. In this paper, we formulate this routing problem as a new variant of the Electric Vehicle Routing Problem (EVRP) and propose a solver that combines a rule-based vehicle selector and a reinforcement learning (RL)-based node selector. The rule of the vehicle selector ensures the exact environmental states when the selected EV starts to move. In addition, the node selection by the RL model enables fast route generation, which is critical in emergencies. We evaluate our solver on both synthetic datasets and real datasets. The results show that our solver outperforms baselines in terms of the objective value and computation time. Moreover, we analyze the generalization and scalability of our solver, demonstrating the capability toward unseen settings and large-scale problems. Check also our project page: this https URL .
- [583] arXiv:2404.02450 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Task Agnostic Architecture for Algorithm Induction via Implicit CompositionComments: 12 pages, 2 figures, 2024 ICLR Generative Models for Decision Making WorkshopSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Different fields in applied machine learning such as computer vision, speech or natural language processing have been building domain-specialised solutions. Currently, we are witnessing an opposing trend towards developing more generalist architectures, driven by Large Language Models and multi-modal foundational models. These architectures are designed to tackle a variety of tasks, including those previously unseen and using inputs across multiple modalities. Taking this trend of generalization to the extreme suggests the possibility of a single deep network architecture capable of solving all tasks. This position paper aims to explore developing such a unified architecture and proposes a theoretical framework of how it could be constructed. Our proposal is based on the following assumptions. Firstly, tasks are solved by following a sequence of instructions, typically implemented in code for conventional computing hardware, which inherently operates sequentially. Second, recent Generative AI, especially Transformer-based models, demonstrate potential as an architecture capable of constructing algorithms for a wide range of domains. For example, GPT-4 shows exceptional capability at in-context learning of novel tasks which is hard to explain in any other way than the ability to compose novel solutions from fragments on previously learnt algorithms. Third, the observation that the main missing component in developing a truly generalised network is an efficient approach for self-consistent input of previously learnt sub-steps of an algorithm and their (implicit) composition during the network's internal forward pass. Our exploration delves into current capabilities and limitations of Transformer-based and other methods in efficient and correct algorithm composition and proposes a Transformer-like architecture as well as a discrete learning framework to overcome these limitations.
- [584] arXiv:2404.02456 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PhonologyBench: Evaluating Phonological Skills of Large Language ModelsComments: 17 pages, 7 figures, 6 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research. LLMs are widely used in various downstream applications that leverage phonology such as educational tools and poetry generation. Moreover, LLMs can potentially learn imperfect associations between orthographic and phonological forms from the training data. Thus, it is imperative to benchmark the phonological skills of LLMs. To this end, we present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs in English: grapheme-to-phoneme conversion, syllable counting, and rhyme word generation. Despite having no access to speech data, LLMs showcased notable performance on the PhonologyBench tasks. However, we observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans. Our findings underscore the importance of studying LLM performance on phonological tasks that inadvertently impact real-world applications. Furthermore, we encourage researchers to choose LLMs that perform well on the phonological task that is closely related to the downstream application since we find that no single model consistently outperforms the others on all the tasks.
- [585] arXiv:2404.02460 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TSNet:A Two-stage Network for Image Dehazing with Multi-scale Fusion and Adaptive LearningComments: 12 pages, 10 figures, 7 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image dehazing has been a popular topic of research for a long time. Previous deep learning-based image dehazing methods have failed to achieve satisfactory dehazing effects on both synthetic datasets and real-world datasets, exhibiting poor generalization. Moreover, single-stage networks often result in many regions with artifacts and color distortion in output images. To address these issues, this paper proposes a two-stage image dehazing network called TSNet, mainly consisting of the multi-scale fusion module (MSFM) and the adaptive learning module (ALM). Specifically, MSFM and ALM enhance the generalization of TSNet. The MSFM can obtain large receptive fields at multiple scales and integrate features at different frequencies to reduce the differences between inputs and learning objectives. The ALM can actively learn of regions of interest in images and restore texture details more effectively. Additionally, TSNet is designed as a two-stage network, where the first-stage network performs image dehazing, and the second-stage network is employed to improve issues such as artifacts and color distortion present in the results of the first-stage network. We also change the learning objective from ground truth images to opposite fog maps, which improves the learning efficiency of TSNet. Extensive experiments demonstrate that TSNet exhibits superior dehazing performance on both synthetic and real-world datasets compared to previous state-of-the-art methods.
- [586] arXiv:2404.02466 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Prompting for Numerical Sequences: A Case Study on Market Comment GenerationComments: Accepted to LREC-COLING2024 long paperSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract: Large language models (LLMs) have been applied to a wide range of data-to-text generation tasks, including tables, graphs, and time-series numerical data-to-text settings. While research on generating prompts for structured data such as tables and graphs is gaining momentum, in-depth investigations into prompting for time-series numerical data are lacking. Therefore, this study explores various input representations, including sequences of tokens and structured formats such as HTML, LaTeX, and Python-style codes. In our experiments, we focus on the task of Market Comment Generation, which involves taking a numerical sequence of stock prices as input and generating a corresponding market comment. Contrary to our expectations, the results show that prompts resembling programming languages yield better outcomes, whereas those similar to natural languages and longer formats, such as HTML and LaTeX, are less effective. Our findings offer insights into creating effective prompts for tasks that generate text from numerical sequences.
- [587] arXiv:2404.02474 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?Comments: 12 pages, 5 figures, 6 tables, Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) @ NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Inspired by human cognition, Jiang et al.(2023c) create a benchmark for assessing LLMs' lateral thinking-thinking outside the box. Building upon this benchmark, we investigate how different prompting methods enhance LLMs' performance on this task to reveal their inherent power for outside-the-box thinking ability. Through participating in SemEval-2024, task 9, Sentence Puzzle sub-task, we explore prompt engineering methods: chain of thoughts (CoT) and direct prompting, enhancing with informative descriptions, and employing contextualizing prompts using a retrieval augmented generation (RAG) pipeline. Our experiments involve three LLMs including GPT-3.5, GPT-4, and Zephyr-7B-beta. We generate a dataset of thinking paths between riddles and options using GPT-4, validated by humans for quality. Findings indicate that compressed informative prompts enhance performance. Dynamic in-context learning enhances model performance significantly. Furthermore, fine-tuning Zephyr on our dataset enhances performance across other commonsense datasets, underscoring the value of innovative thinking.
- [588] arXiv:2404.02476 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Deep Reinforcement Learning for Traveling Purchaser ProblemsSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The traveling purchaser problem (TPP) is an important combinatorial optimization problem with broad applications. Due to the coupling between routing and purchasing, existing works on TPPs commonly address route construction and purchase planning simultaneously, which, however, leads to exact methods with high computational cost and heuristics with sophisticated design but limited performance. In sharp contrast, we propose a novel approach based on deep reinforcement learning (DRL), which addresses route construction and purchase planning separately, while evaluating and optimizing the solution from a global perspective. The key components of our approach include a bipartite graph representation for TPPs to capture the market-product relations, and a policy network that extracts information from the bipartite graph and uses it to sequentially construct the route. One significant benefit of our framework is that we can efficiently construct the route using the policy network, and once the route is determined, the associated purchasing plan can be easily derived through linear programming, while, leveraging DRL, we can train the policy network to optimize the global solution objective. Furthermore, by introducing a meta-learning strategy, the policy network can be trained stably on large-sized TPP instances, and generalize well across instances of varying sizes and distributions, even to much larger instances that are never seen during training. Experiments on various synthetic TPP instances and the TPPLIB benchmark demonstrate that our DRL-based approach can significantly outperform well-established TPP heuristics, reducing the optimality gap by 40%-90%, and also showing an advantage in runtime, especially on large-sized instances.
- [589] arXiv:2404.02477 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Enhancing Sum-Rate Performance in Constrained Multicell Networks: A Low-Information Exchange ApproachComments: 5 pages, 12 figuresSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: Despite the extensive research on massive MIMO systems for 5G telecommunications and beyond, the reality is that many deployed base stations are equipped with a limited number of antennas rather than supporting massive MIMO configurations. Furthermore, while the cell-less network concept, which eliminates cell boundaries, is under investigation, practical deployments often grapple with significantly limited backhaul connection capacities between base stations. This letter explores techniques to maximize the sum-rate performance within the constraints of these more realistically equipped multicell networks. We propose an innovative approach that dramatically reduces the need for information exchange between base stations to a mere few bits, in stark contrast to conventional methods that require the exchange of hundreds of bits. Our proposed method not only addresses the limitations imposed by current network infrastructure but also showcases significantly improved performance under these constrained conditions.
- [590] arXiv:2404.02478 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-TuningComments: Published in CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Standard federated learning approaches suffer when client data distributions have sufficient heterogeneity. Recent methods addressed the client data heterogeneity issue via personalized federated learning (PFL) - a class of FL algorithms aiming to personalize learned global knowledge to better suit the clients' local data distributions. Existing PFL methods usually decouple global updates in deep neural networks by performing personalization on particular layers (i.e. classifier heads) and global aggregation for the rest of the network. However, preselecting network layers for personalization may result in suboptimal storage of global knowledge. In this work, we propose FedSelect, a novel PFL algorithm inspired by the iterative subnetwork discovery procedure used for the Lottery Ticket Hypothesis. FedSelect incrementally expands subnetworks to personalize client parameters, concurrently conducting global aggregations on the remaining parameters. This approach enables the personalization of both client parameters and subnetwork structure during the training process. Finally, we show that FedSelect outperforms recent state-of-the-art PFL algorithms under challenging client data heterogeneity settings and demonstrates robustness to various real-world distributional shifts. Our code is available at this https URL .
- [591] arXiv:2404.02484 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: New methods for drug synergy prediction: a mini-reviewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Abstract: In this mini-review, we explore the new prediction methods for drug combination synergy relying on high-throughput combinatorial screens. The fast progress of the field is witnessed in the more than thirty original machine learning methods published since 2021, a clear majority of them based on deep learning techniques. We aim to put these papers under a unifying lens by highlighting the core technologies, the data sources, the input data types and synergy scores used in the methods, as well as the prediction scenarios and evaluation protocols that the papers deal with. Our finding is that the best methods accurately solve the synergy prediction scenarios involving known drugs or cell lines while the scenarios involving new drugs or cell lines still fall short of an accurate prediction level.
- [592] arXiv:2404.02491 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Measuring Social Norms of Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models' ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements. The proposed method and dataset are available in this https URL .
- [593] arXiv:2404.02505 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support ConversationComments: Accpeted by SIGIR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Emotional Support Conversation (ESC) systems are pivotal in providing empathetic interactions, aiding users through negative emotional states by understanding and addressing their unique experiences. In this paper, we tackle two key challenges in ESC: enhancing contextually relevant and empathetic response generation through dynamic demonstration retrieval, and advancing cognitive understanding to grasp implicit mental states comprehensively. We introduce Dynamic Demonstration Retrieval and Cognitive-Aspect Situation Understanding (\ourwork), a novel approach that synergizes these elements to improve the quality of support provided in ESCs. By leveraging in-context learning and persona information, we introduce an innovative retrieval mechanism that selects informative and personalized demonstration pairs. We also propose a cognitive understanding module that utilizes four cognitive relationships from the ATOMIC knowledge source to deepen situational awareness of help-seekers' mental states. Our supportive decoder integrates information from diverse knowledge sources, underpinning response generation that is both empathetic and cognitively aware. The effectiveness of \ourwork is demonstrated through extensive automatic and human evaluations, revealing substantial improvements over numerous state-of-the-art models, with up to 13.79\% enhancement in overall performance of ten metrics. Our codes are available for public access to facilitate further research and development.
- [594] arXiv:2404.02508 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VIAssist: Adapting Multi-modal Large Language Models for Users with Visual ImpairmentsComments: Accepted to IEEE International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Individuals with visual impairments, encompassing both partial and total difficulties in visual perception, are referred to as visually impaired (VI) people. An estimated 2.2 billion individuals worldwide are affected by visual impairments. Recent advancements in multi-modal large language models (MLLMs) have showcased their extraordinary capabilities across various domains. It is desirable to help VI individuals with MLLMs' great capabilities of visual understanding and reasoning. However, it is challenging for VI people to use MLLMs due to the difficulties in capturing the desirable images to fulfill their daily requests. For example, the target object is not fully or partially placed in the image. This paper explores how to leverage MLLMs for VI individuals to provide visual-question answers. VIAssist can identify undesired images and provide detailed actions. Finally, VIAssist can provide reliable answers to users' queries based on the images. Our results show that VIAssist provides +0.21 and +0.31 higher BERTScore and ROUGE scores than the baseline, respectively.
- [595] arXiv:2404.02510 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: An Interpretable Client Decision Tree Aggregation process for Federated LearningComments: Submitted to Information Science JournalSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Trustworthy Artificial Intelligence solutions are essential in today's data-driven applications, prioritizing principles such as robustness, safety, transparency, explainability, and privacy among others. This has led to the emergence of Federated Learning as a solution for privacy and distributed machine learning. While decision trees, as self-explanatory models, are ideal for collaborative model training across multiple devices in resource-constrained environments such as federated learning environments for injecting interpretability in these models. Decision tree structure makes the aggregation in a federated learning environment not trivial. They require techniques that can merge their decision paths without introducing bias or overfitting while keeping the aggregated decision trees robust and generalizable. In this paper, we propose an Interpretable Client Decision Tree Aggregation process for Federated Learning scenarios that keeps the interpretability and the precision of the base decision trees used for the aggregation. This model is based on aggregating multiple decision paths of the decision trees and can be used on different decision tree types, such as ID3 and CART. We carry out the experiments within four datasets, and the analysis shows that the tree built with the model improves the local models, and outperforms the state-of-the-art.
- [596] arXiv:2404.02515 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Tightly-Coupled LiDAR-IMU-Wheel Odometry with Online Calibration of a Kinematic Model for Skid-Steering RobotsTaku Okawara , Kenji Koide , Shuji Oishi , Masashi Yokozuka , Atsuhiko Banno , Kentaro Uno , Kazuya YoshidaSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Tunnels and long corridors are challenging environments for mobile robots because a LiDAR point cloud should degenerate in these environments. To tackle point cloud degeneration, this study presents a tightly-coupled LiDAR-IMU-wheel odometry algorithm with an online calibration for skid-steering robots. We propose a full linear wheel odometry factor, which not only serves as a motion constraint but also performs the online calibration of kinematic models for skid-steering robots. Despite the dynamically changing kinematic model (e.g., wheel radii changes caused by tire pressures) and terrain conditions, our method can address the model error via online calibration. Moreover, our method enables an accurate localization in cases of degenerated environments, such as long and straight corridors, by calibration while the LiDAR-IMU fusion sufficiently operates. Furthermore, we estimate the uncertainty (i.e., covariance matrix) of the wheel odometry online for creating a reasonable constraint. The proposed method is validated through three experiments. The first indoor experiment shows that the proposed method is robust in severe degeneracy cases (long corridors) and changes in the wheel radii. The second outdoor experiment demonstrates that our method accurately estimates the sensor trajectory despite being in rough outdoor terrain owing to online uncertainty estimation of wheel odometry. The third experiment shows the proposed online calibration enables robust odometry estimation in changing terrains.
- [597] arXiv:2404.02523 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Text-driven Affordance Learning from Egocentric VisionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both hand-object and tool-object interactions. We introduce text-driven affordance learning, aiming to learn contact points and manipulation trajectories from an egocentric view following textual instruction. In our task, contact points are represented as heatmaps, and the manipulation trajectory as sequences of coordinates that incorporate both linear and rotational movements for various manipulations. However, when we gather data for this task, manual annotations of these diverse interactions are costly. To this end, we propose a pseudo dataset creation pipeline and build a large pseudo-training dataset: TextAFF80K, consisting of over 80K instances of the contact points, trajectories, images, and text tuples. We extend existing referring expression comprehension models for our task, and experimental results show that our approach robustly handles multiple affordances, serving as a new standard for affordance learning in real-world scenarios.
- [598] arXiv:2404.02530 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Severity Controlled Text-to-Image Generative Model Bias ManipulationComments: This research was supported by National Intelligence and Security Discovery Research Grants (project# NS220100007), funded by the Department of Defence AustraliaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image (T2I) generative models are gaining wide popularity, especially in public domains. However, their intrinsic bias and potential malicious manipulations remain under-explored. Charting the susceptibility of T2I models to such manipulation, we first expose the new possibility of a dynamic and computationally efficient exploitation of model bias by targeting the embedded language models. By leveraging mathematical foundations of vector algebra, our technique enables a scalable and convenient control over the severity of output manipulation through model bias. As a by-product, this control also allows a form of precise prompt engineering to generate images which are generally implausible with regular text prompts. We also demonstrate a constructive application of our manipulation for balancing the frequency of generated classes - as in model debiasing. Our technique does not require training and is also framed as a backdoor attack with severity control using semantically-null text triggers in the prompts. With extensive analysis, we present interesting qualitative and quantitative results to expose potential manipulation possibilities for T2I models.
Key-words: Text-to-Image Models, Generative Models, Backdoor Attacks, Prompt Engineering, Bias - [599] arXiv:2404.02534 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language ModelSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
- [600] arXiv:2404.02543 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Unbiased Learning to Rank Meets Reality: Lessons from Baidu's Large-Scale Search DatasetSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Unbiased learning-to-rank (ULTR) is a well-established framework for learning from user clicks, which are often biased by the ranker collecting the data. While theoretically justified and extensively tested in simulation, ULTR techniques lack empirical validation, especially on modern search engines. The Baidu-ULTR dataset released for the WSDM Cup 2023, collected from Baidu's search engine, offers a rare opportunity to assess the real-world performance of prominent ULTR techniques. Despite multiple submissions during the WSDM Cup 2023 and the subsequent NTCIR ULTRE-2 task, it remains unclear whether the observed improvements stem from applying ULTR or other learning techniques.
In this work, we revisit and extend the available experiments on the Baidu-ULTR dataset. We find that standard unbiased learning-to-rank techniques robustly improve click predictions but struggle to consistently improve ranking performance, especially considering the stark differences obtained by choice of ranking loss and query-document features. Our experiments reveal that gains in click prediction do not necessarily translate to enhanced ranking performance on expert relevance annotations, implying that conclusions strongly depend on how success is measured in this benchmark. - [601] arXiv:2404.02545 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline reinforcement learning learns from a static dataset without interacting with the environment, which ensures security and thus owns a good prospect of application. However, directly applying naive reinforcement learning methods usually fails in an offline environment due to function approximation errors caused by out-of-distribution(OOD) actions. To solve this problem, existing algorithms mainly penalize the Q-value of OOD actions, the quality of whose constraints also matter. Imprecise constraints may lead to suboptimal solutions, while precise constraints require significant computational costs. In this paper, we propose a novel count-based method for continuous domains, called Grid-Mapping Pseudo-Count method(GPC), to penalize the Q-value appropriately and reduce the computational cost. The proposed method maps the state and action space to discrete space and constrains their Q-values through the pseudo-count. It is theoretically proved that only a few conditions are needed to obtain accurate uncertainty constraints in the proposed method. Moreover, we develop a Grid-Mapping Pseudo-Count Soft Actor-Critic(GPC-SAC) algorithm using GPC under the Soft Actor-Critic(SAC) framework to demonstrate the effectiveness of GPC. The experimental results on D4RL benchmark datasets show that GPC-SAC has better performance and less computational cost compared to other algorithms.
- [602] arXiv:2404.02548 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: AI-Tutoring in Software Engineering EducationComments: 11 pages, 5 figuresSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: With the rapid advancement of artificial intelligence (AI) in various domains, the education sector is set for transformation. The potential of AI-driven tools in enhancing the learning experience, especially in programming, is immense. However, the scientific evaluation of Large Language Models (LLMs) used in Automated Programming Assessment Systems (APASs) as an AI-Tutor remains largely unexplored. Therefore, there is a need to understand how students interact with such AI-Tutors and to analyze their experiences. In this paper, we conducted an exploratory case study by integrating the GPT-3.5-Turbo model as an AI-Tutor within the APAS Artemis. Through a combination of empirical data collection and an exploratory survey, we identified different user types based on their interaction patterns with the AI-Tutor. Additionally, the findings highlight advantages, such as timely feedback and scalability. However, challenges like generic responses and students' concerns about a learning progress inhibition when using the AI-Tutor were also evident. This research adds to the discourse on AI's role in education.
- [603] arXiv:2404.02552 (cross-list from astro-ph.SR) [ pdf , ps , html , other ]
-
Title: Solar synthetic imaging: Introducing denoising diffusion probabilistic models on SDO/AIA dataFrancesco P. Ramunno , S. Hackstein , V. Kinakh , M. Drozdova , G. Quetant , A. Csillaghy , S. VoloshynovskiyComments: 16 pages, 10 figures. Accepted for publication in Astronomy and Astrophysics (A&A)Subjects: Solar and Stellar Astrophysics (astro-ph.SR) ; Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
Abstract: Given the rarity of significant solar flares compared to smaller ones, training effective machine learning models for solar activity forecasting is challenging due to insufficient data. This study proposes using generative deep learning models, specifically a Denoising Diffusion Probabilistic Model (DDPM), to create synthetic images of solar phenomena, including flares of varying intensities. By employing a dataset from the AIA instrument aboard the SDO spacecraft, focusing on the 171 Å band that captures various solar activities, and classifying images with GOES X-ray measurements based on flare intensity, we aim to address the data scarcity issue. The DDPM's performance is evaluated using cluster metrics, Frechet Inception Distance (FID), and F1-score, showcasing promising results in generating realistic solar imagery. We conduct two experiments: one to train a supervised classifier for event identification and another for basic flare prediction, demonstrating the value of synthetic data in managing imbalanced datasets. This research underscores the potential of DDPMs in solar data analysis and forecasting, suggesting further exploration into their capabilities for solar flare prediction and application in other deep learning and physical tasks.
- [604] arXiv:2404.02569 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: SliceIt! -- A Dual Simulator Framework for Learning Robot Food SlicingComments: Accepted to ICRA 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Cooking robots can enhance the home experience by reducing the burden of daily chores. However, these robots must perform their tasks dexterously and safely in shared human environments, especially when handling dangerous tools such as kitchen knives. This study focuses on enabling a robot to autonomously and safely learn food-cutting tasks. More specifically, our goal is to enable a collaborative robot or industrial robot arm to perform food-slicing tasks by adapting to varying material properties using compliance control. Our approach involves using Reinforcement Learning (RL) to train a robot to compliantly manipulate a knife, by reducing the contact forces exerted by the food items and by the cutting board. However, training the robot in the real world can be inefficient, and dangerous, and result in a lot of food waste. Therefore, we proposed SliceIt!, a framework for safely and efficiently learning robot food-slicing tasks in simulation. Following a real2sim2real approach, our framework consists of collecting a few real food slicing data, calibrating our dual simulation environment (a high-fidelity cutting simulator and a robotic simulator), learning compliant control policies on the calibrated simulation environment, and finally, deploying the policies on the real robot.
- [605] arXiv:2404.02580 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Active learning for efficient annotation in precision agriculture: a use-case on crop-weed semantic segmentationBart M. van Marrewijk , Charbel Dandjinou , Dan Jeric Arcega Rustia , Nicolas Franco Gonzalez , Boubacar Diallo , Jérôme Dias , Paul Melki , Pieter M. BlokSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Optimizing deep learning models requires large amounts of annotated images, a process that is both time-intensive and costly. Especially for semantic segmentation models in which every pixel must be annotated. A potential strategy to mitigate annotation effort is active learning. Active learning facilitates the identification and selection of the most informative images from a large unlabelled pool. The underlying premise is that these selected images can improve the model's performance faster than random selection to reduce annotation effort. While active learning has demonstrated promising results on benchmark datasets like Cityscapes, its performance in the agricultural domain remains largely unexplored. This study addresses this research gap by conducting a comparative study of three active learning-based acquisition functions: Bayesian Active Learning by Disagreement (BALD), stochastic-based BALD (PowerBALD), and Random. The acquisition functions were tested on two agricultural datasets: Sugarbeet and Corn-Weed, both containing three semantic classes: background, crop and weed. Our results indicated that active learning, especially PowerBALD, yields a higher performance than Random sampling on both datasets. But due to the relatively large standard deviations, the differences observed were minimal; this was partly caused by high image redundancy and imbalanced classes. Specifically, more than 89\% of the pixels belonged to the background class on both datasets. The absence of significant results on both datasets indicates that further research is required for applying active learning on agricultural datasets, especially if they contain a high-class imbalance and redundant images. Recommendations and insights are provided in this paper to potentially resolve such issues.
- [606] arXiv:2404.02587 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: The Surprising Effectiveness of Rankers Trained on Expanded QueriesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: An important problem in text-ranking systems is handling the hard queries that form the tail end of the query distribution. The difficulty may arise due to the presence of uncommon, underspecified, or incomplete queries. In this work, we improve the ranking performance of hard or difficult queries without compromising the performance of other queries. Firstly, we do LLM based query enrichment for training queries using relevant documents. Next, a specialized ranker is fine-tuned only on the enriched hard queries instead of the original queries. We combine the relevance scores from the specialized ranker and the base ranker, along with a query performance score estimated for each query. Our approach departs from existing methods that usually employ a single ranker for all queries, which is biased towards easy queries, which form the majority of the query distribution. In our extensive experiments on the DL-Hard dataset, we find that a principled query performance based scoring method using base and specialized ranker offers a significant improvement of up to 25% on the passage ranking task and up to 48.4% on the document ranking task when compared to the baseline performance of using original queries, even outperforming SOTA model.
- [607] arXiv:2404.02589 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Affective-NLI: Towards Accurate and Interpretable Personality Recognition in ConversationComments: Accepted by IEEE PerCom 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Personality Recognition in Conversation (PRC) aims to identify the personality traits of speakers through textual dialogue content. It is essential for providing personalized services in various applications of Human-Computer Interaction (HCI), such as AI-based mental therapy and companion robots for the elderly. Most recent studies analyze the dialog content for personality classification yet overlook two major concerns that hinder their performance. First, crucial implicit factors contained in conversation, such as emotions that reflect the speakers' personalities are ignored. Second, only focusing on the input dialog content disregards the semantic understanding of personality itself, which reduces the interpretability of the results. In this paper, we propose Affective Natural Language Inference (Affective-NLI) for accurate and interpretable PRC. To utilize affectivity within dialog content for accurate personality recognition, we fine-tuned a pre-trained language model specifically for emotion recognition in conversations, facilitating real-time affective annotations for utterances. For interpretability of recognition results, we formulate personality recognition as an NLI problem by determining whether the textual description of personality labels is entailed by the dialog content. Extensive experiments on two daily conversation datasets suggest that Affective-NLI significantly outperforms (by 6%-7%) state-of-the-art approaches. Additionally, our Flow experiment demonstrates that Affective-NLI can accurately recognize the speaker's personality in the early stages of conversations by surpassing state-of-the-art methods with 22%-34%.
- [608] arXiv:2404.02618 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Diffexplainer: Towards Cross-modal Global Explanations with Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present DiffExplainer, a novel framework that, leveraging language-vision models, enables multimodal global explainability. DiffExplainer employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs and hidden features of a classifier, thus providing a visual tool for explaining decisions. Moreover, the analysis of generated visual descriptions allows for automatic identification of biases and spurious features, as opposed to traditional methods that often rely on manual intervention. The cross-modal transferability of language-vision models also enables the possibility to describe decisions in a more human-interpretable way, i.e., through text. We conduct comprehensive experiments, which include an extensive user study, demonstrating the effectiveness of DiffExplainer on 1) the generation of high-quality images explaining model decisions, surpassing existing activation maximization methods, and 2) the automated identification of biases and spurious features.
- [609] arXiv:2404.02625 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Differentiable Integer Linear Programming Solver for Explanation-Based Natural Language InferenceComments: Accepted to LREC-COLING 2024 - Camera Ready. arXiv admin note: substantial text overlap with arXiv:2208.03339Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Integer Linear Programming (ILP) has been proposed as a formalism for encoding precise structural and semantic constraints for Natural Language Inference (NLI). However, traditional ILP frameworks are non-differentiable, posing critical challenges for the integration of continuous language representations based on deep learning. In this paper, we introduce a novel approach, named Diff-Comb Explainer, a neuro-symbolic architecture for explanation-based NLI based on Differentiable BlackBox Combinatorial Solvers (DBCS). Differently from existing neuro-symbolic solvers, Diff-Comb Explainer does not necessitate a continuous relaxation of the semantic constraints, enabling a direct, more precise, and efficient incorporation of neural representations into the ILP formulation. Our experiments demonstrate that Diff-Comb Explainer achieves superior performance when compared to conventional ILP solvers, neuro-symbolic black-box solvers, and Transformer-based encoders. Moreover, a deeper analysis reveals that Diff-Comb Explainer can significantly improve the precision, consistency, and faithfulness of the constructed explanations, opening new opportunities for research on neuro-symbolic architectures for explainable and transparent NLI in complex domains.
- [610] arXiv:2404.02637 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Vocabulary Attack to Hijack Large Language Model ApplicationsComments: To be published in: Proc of the 14th International Conference on Cloud Computing, GRIDs, and Virtualization (Cloud Computing 2024), Venice, Italy, April 2024Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: The fast advancements in Large Language Models (LLMs) are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect. For many attack cases, we find that even a single word insertion is sufficient. Second, we demonstrate that we can conduct our attack using a different model than the target model to conduct our attack with.
- [611] arXiv:2404.02648 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: A Universal Deep Neural Network for Signal Detection in Wireless Communication SystemsSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Abstract: Recently, deep learning (DL) has been emerging as a promising approach for channel estimation and signal detection in wireless communications. The majority of the existing studies investigating the use of DL techniques in this domain focus on analysing channel impulse responses that are generated from only one channel distribution such as additive white Gaussian channel noise and Rayleigh channels. In practice, to cope with the dynamic nature of the wireless channel, DL methods must be re-trained on newly non-aged collected data which is costly, inefficient, and impractical. To tackle this challenge, this paper proposes a novel universal deep neural network (Uni-DNN) that can achieve high detection performance in various wireless environments without retraining the model. In particular, our proposed Uni-DNN model consists of a wireless channel classifier and a signal detector which are constructed by using DNNs. The wireless channel classifier enables the signal detector to generalise and perform optimally for multiple wireless channel distributions. In addition, to further improve the signal detection performance of the proposed model, convolutional neural network is employed. Extensive simulations using the orthogonal frequency division multiplexing scheme demonstrate that the bit error rate performance of our proposed solution can outperform conventional DL-based approaches as well as least square and minimum mean square error channel estimators in practical low pilot density scenarios.
- [612] arXiv:2404.02650 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards detecting unanticipated bias in Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Over the last year, Large Language Models (LLMs) like ChatGPT have become widely available and have exhibited fairness issues similar to those in previous machine learning systems. Current research is primarily focused on analyzing and quantifying these biases in training data and their impact on the decisions of these models, alongside developing mitigation strategies. This research largely targets well-known biases related to gender, race, ethnicity, and language. However, it is clear that LLMs are also affected by other, less obvious implicit biases. The complex and often opaque nature of these models makes detecting such biases challenging, yet this is crucial due to their potential negative impact in various applications. In this paper, we explore new avenues for detecting these unanticipated biases in LLMs, focusing specifically on Uncertainty Quantification and Explainable AI methods. These approaches aim to assess the certainty of model decisions and to make the internal decision-making processes of LLMs more transparent, thereby identifying and understanding biases that are not immediately apparent. Through this research, we aim to contribute to the development of fairer and more transparent AI systems.
- [613] arXiv:2404.02656 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Non-negative Subspace Feature Representation for Few-shot Learning in Medical ImagingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Unlike typical visual scene recognition domains, in which massive datasets are accessible to deep neural networks, medical image interpretations are often obstructed by the paucity of data. In this paper, we investigate the effectiveness of data-based few-shot learning in medical imaging by exploring different data attribute representations in a low-dimensional space. We introduce different types of non-negative matrix factorization (NMF) in few-shot learning, addressing the data scarcity issue in medical image classification. Extensive empirical studies are conducted in terms of validating the effectiveness of NMF, especially its supervised variants (e.g., discriminative NMF, and supervised and constrained NMF with sparseness), and the comparison with principal component analysis (PCA), i.e., the collaborative representation-based dimensionality reduction technique derived from eigenvectors. With 14 different datasets covering 11 distinct illness categories, thorough experimental results and comparison with related techniques demonstrate that NMF is a competitive alternative to PCA for few-shot learning in medical imaging, and the supervised NMF algorithms are more discriminative in the subspace with greater effectiveness. Furthermore, we show that the part-based representation of NMF, especially its supervised variants, is dramatically impactful in detecting lesion areas in medical imaging with limited samples.
- [614] arXiv:2404.02657 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language ModelsComments: Under review as a conference paper at COLM 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
- [615] arXiv:2404.02675 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Responsible Reporting for Frontier AI DevelopmentNoam Kolt , Markus Anderljung , Joslyn Barnhart , Asher Brass , Kevin Esvelt , Gillian K. Hadfield , Lennart Heim , Mikel Rodriguez , Jonas B. Sandbrink , Thomas WoodsideSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Mitigating the risks from frontier AI systems requires up-to-date and reliable information about those systems. Organizations that develop and deploy frontier systems have significant access to such information. By reporting safety-critical information to actors in government, industry, and civil society, these organizations could improve visibility into new and emerging risks posed by frontier systems. Equipped with this information, developers could make better informed decisions on risk management, while policymakers could design more targeted and robust regulatory infrastructure. We outline the key features of responsible reporting and propose mechanisms for implementing them in practice.
- [616] arXiv:2404.02681 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PejorativITy: Disambiguating Pejorative Epithets to Improve Misogyny Detection in Italian TweetsArianna Muti , Federico Ruggeri , Cagri Toraman , Lorenzo Musetti , Samuel Algherini , Silvia Ronchi , Gianmarco Saretto , Caterina Zapparoli , Alberto Barrón-CedeñoSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Misogyny is often expressed through figurative language. Some neutral words can assume a negative connotation when functioning as pejorative epithets. Disambiguating the meaning of such terms might help the detection of misogyny. In order to address such task, we present PejorativITy, a novel corpus of 1,200 manually annotated Italian tweets for pejorative language at the word level and misogyny at the sentence level. We evaluate the impact of injecting information about disambiguated words into a model targeting misogyny detection. In particular, we explore two different approaches for injection: concatenation of pejorative information and substitution of ambiguous words with univocal terms. Our experimental results, both on our corpus and on two popular benchmarks on Italian tweets, show that both approaches lead to a major classification improvement, indicating that word sense disambiguation is a promising preliminary step for misogyny detection. Furthermore, we investigate LLMs' understanding of pejorative epithets by means of contextual word embeddings analysis and prompting.
- [617] arXiv:2404.02684 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Cross-Architecture Transfer Learning for Linear-Cost Inference TransformersComments: PreprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference (LCI). A notable approach in this realm is the State-Space Machines (SSMs) architecture, which showed on-par performance on language modeling tasks with the self-attention transformers. However, such an architectural change requires a full pretraining of the weights from scratch, which incurs a huge cost to researchers and practitioners who want to use the new architectures. In the more traditional linear attention works, it has been proposed to approximate full attention with linear attention by swap-and-finetune framework. Motivated by this approach, we propose Cross-Architecture Transfer Learning (XATL), in which the weights of the shared components between LCI and self-attention-based transformers, such as layernorms, MLPs, input/output embeddings, are directly transferred to the new architecture from already pre-trained model parameters. We experimented the efficacy of the method on varying sizes and alternative attention architectures and show that \methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
- [618] arXiv:2404.02690 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Attention is Naturally Sparse with Gaussian Distributed InputSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The computational intensity of Large Language Models (LLMs) is a critical bottleneck, primarily due to the $O(n^2)$ complexity of the attention mechanism in transformer architectures. Addressing this, sparse attention emerges as a key innovation, aiming to reduce computational load while maintaining model performance. This study presents a rigorous theoretical analysis of the sparsity in attention scores within LLMs, particularly under the framework of Gaussian inputs. By establishing a set of foundational assumptions and employing a methodical theoretical approach, we unravel the intrinsic characteristics of attention score sparsity and its implications on computational efficiency. Our main contribution lies in providing a detailed theoretical examination of how sparsity manifests in attention mechanisms, offering insights into the potential trade-offs between computational savings and model effectiveness. This work not only advances our understanding of sparse attention but also provides a scaffold for future research in optimizing the computational frameworks of LLMs, paving the way for more scalable and efficient AI systems.
- [619] arXiv:2404.02702 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt EncodersComments: 7Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI)
Abstract: Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc. However, ensuring high-fidelity audio reconstruction of speech codecs under low bitrate remains an open and challenging issue. In this paper, we propose PromptCodec, a novel end-to-end neural speech codec using feature-aware prompt encoders based on disentangled representation learning. By incorporating prompt encoders to capture representations of additional input prompts, PromptCodec can distribute the speech information requiring processing and enhance its representation capabilities. Moreover, a simple yet effective adaptive feature weighted fusion approach is introduced to integrate features of different encoders. Meanwhile, we propose a novel disentangled representation learning strategy based on structure similarity index measure to optimize PromptCodec's encoders to ensure their efficiency, thereby further improving the performance of PromptCodec. Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently outperforms state-of-the-art neural speech codec models under all different bitrate conditions while achieving superior performance with low bitrates.
- [620] arXiv:2404.02719 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Can We Understand Plasticity Through Neural Collapse?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores the connection between two recently identified phenomena in deep learning: plasticity loss and neural collapse. We analyze their correlation in different scenarios, revealing a significant association during the initial training phase on the first task. Additionally, we introduce a regularization approach to mitigate neural collapse, demonstrating its effectiveness in alleviating plasticity loss in this specific setting.
- [621] arXiv:2404.02728 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Unsupervised Learning of Effective Actions in RoboticsComments: Accepted at The First Austrian Symposium on AI, Robotics, and Vision (AIROV24)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Learning actions that are relevant to decision-making and can be executed effectively is a key problem in autonomous robotics. Current state-of-the-art action representations in robotics lack proper effect-driven learning of the robot's actions. Although successful in solving manipulation tasks, deep learning methods also lack this ability, in addition to their high cost in terms of memory or training data. In this paper, we propose an unsupervised algorithm to discretize a continuous motion space and generate "action prototypes", each producing different effects in the environment. After an exploration phase, the algorithm automatically builds a representation of the effects and groups motions into action prototypes, where motions more likely to produce an effect are represented more than those that lead to negligible changes. We evaluate our method on a simulated stair-climbing reinforcement learning task, and the preliminary results show that our effect driven discretization outperforms uniformly and randomly sampled discretizations in convergence speed and maximum reward.
- [622] arXiv:2404.02729 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Learning Sequence Attractors in Recurrent Networks with Hidden NeuronsSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The brain is targeted for processing temporal sequence information. It remains largely unclear how the brain learns to store and retrieve sequence memories. Here, we study how recurrent networks of binary neurons learn sequence attractors to store predefined pattern sequences and retrieve them robustly. We show that to store arbitrary pattern sequences, it is necessary for the network to include hidden neurons even though their role in displaying sequence memories is indirect. We develop a local learning algorithm to learn sequence attractors in the networks with hidden neurons. The algorithm is proven to converge and lead to sequence attractors. We demonstrate that the network model can store and retrieve sequences robustly on synthetic and real-world datasets. We hope that this study provides new insights in understanding sequence memory and temporal information processing in the brain.
- [623] arXiv:2404.02755 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online RefinementComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs), we generate rich DVC-oriented caption candidates and optimize the corresponding pseudo boundaries under several meticulously designed objectives, considering diversity, event-centricity, temporal ordering, and coherence. Moreover, we further introduce a novel online boundary refinement strategy that iteratively improves the quality of pseudo boundaries during training. Comprehensive experiments have been conducted to examine the effectiveness of the proposed technique components. By leveraging a substantial amount of unlabeled video data, such as HowTo100M, we achieve a remarkable advancement on standard DVC datasets like YouCook2 and ActivityNet. We outperform the previous state-of-the-art Vid2Seq across a majority of metrics, achieving this with just 0.4% of the unlabeled video data used for pre-training by Vid2Seq.
- [624] arXiv:2404.02759 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Unsupervised Occupancy Learning from Sparse Point CloudComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: Implicit Neural Representations have gained prominence as a powerful framework for capturing complex data modalities, encompassing a wide range from 3D shapes to images and audio. Within the realm of 3D shape representation, Neural Signed Distance Functions (SDF) have demonstrated remarkable potential in faithfully encoding intricate shape geometry. However, learning SDFs from 3D point clouds in the absence of ground truth supervision remains a very challenging task. In this paper, we propose a method to infer occupancy fields instead of SDFs as they are easier to learn from sparse inputs. We leverage a margin-based uncertainty measure to differentially sample from the decision boundary of the occupancy function and supervise the sampled boundary points using the input point cloud. We further stabilize the optimization process at the early stages of the training by biasing the occupancy function towards minimal entropy fields while maximizing its entropy at the input point cloud. Through extensive experiments and evaluations, we illustrate the efficacy of our proposed method, highlighting its capacity to improve implicit shape inference with respect to baselines and the state-of-the-art using synthetic and real data.
- [625] arXiv:2404.02761 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AQuA -- Combining Experts' and Non-Experts' Views To Assess Deliberation Quality in Online Discussions Using LLMsMaike Behrendt , Stefan Sylvius Wagner , Marc Ziegele , Lena Wilms , Anke Stoll , Dominique Heinbach , Stefan HarmelingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Measuring the quality of contributions in political online discussions is crucial in deliberation research and computer science. Research has identified various indicators to assess online discussion quality, and with deep learning advancements, automating these measures has become feasible. While some studies focus on analyzing specific quality indicators, a comprehensive quality score incorporating various deliberative aspects is often preferred. In this work, we introduce AQuA, an additive score that calculates a unified deliberative quality score from multiple indices for each discussion post. Unlike other singular scores, AQuA preserves information on the deliberative aspects present in comments, enhancing model transparency. We develop adapter models for 20 deliberative indices, and calculate correlation coefficients between experts' annotations and the perceived deliberativeness by non-experts to weigh the individual indices into a single deliberative score. We demonstrate that the AQuA score can be computed easily from pre-trained adapters and aligns well with annotations on other datasets that have not be seen during training. The analysis of experts' vs. non-experts' annotations confirms theoretical findings in the social science literature.
- [626] arXiv:2404.02785 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Domain Generalization through Meta-Learning: A SurveySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Abstract: Deep neural networks (DNNs) have revolutionized artificial intelligence but often lack performance when faced with out-of-distribution (OOD) data, a common scenario due to the inevitable domain shifts in real-world applications. This limitation stems from the common assumption that training and testing data share the same distribution-an assumption frequently violated in practice. Despite their effectiveness with large amounts of data and computational power, DNNs struggle with distributional shifts and limited labeled data, leading to overfitting and poor generalization across various tasks and domains. Meta-learning presents a promising approach by employing algorithms that acquire transferable knowledge across various tasks for fast adaptation, eliminating the need to learn each task from scratch. This survey paper delves into the realm of meta-learning with a focus on its contribution to domain generalization. We first clarify the concept of meta-learning for domain generalization and introduce a novel taxonomy based on the feature extraction strategy and the classifier learning methodology, offering a granular view of methodologies. Through an exhaustive review of existing methods and underlying theories, we map out the fundamentals of the field. Our survey provides practical insights and an informed discussion on promising research directions, paving the way for future innovation in meta-learning for domain generalization.
- [627] arXiv:2404.02800 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: On Few-Shot Prompting for Controllable Question-Answer Generation in Narrative ComprehensionComments: Preprint - Accepted for publication at CSEDU 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Question Generation aims to automatically generate questions based on a given input provided as context. A controllable question generation scheme focuses on generating questions with specific attributes, allowing better control. In this study, we propose a few-shot prompting strategy for controlling the generation of question-answer pairs from children's narrative texts. We aim to control two attributes: the question's explicitness and underlying narrative elements. With empirical evaluation, we show the effectiveness of controlling the generation process by employing few-shot prompting side by side with a reference model. Our experiments highlight instances where the few-shot strategy surpasses the reference model, particularly in scenarios such as semantic closeness evaluation and the diversity and coherency of question-answer pairs. However, these improvements are not always statistically significant. The code is publicly available at this http URL .
- [628] arXiv:2404.02806 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: The RealHumanEval: Evaluating Large Language Models' Abilities to Support ProgrammersHussein Mozannar , Valerie Chen , Mohammed Alsobay , Subhro Das , Sebastian Zhao , Dennis Wei , Manish Nagireddy , Prasanna Sattigeri , Ameet Talwalkar , David SontagSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Evaluation of large language models (LLMs) for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), which measure the ability of LLMs to generate complete code that passes unit tests. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks translate to gains in programmer productivity when coding with LLMs, including time spent coding. In addition to static benchmarks, we investigate the utility of preference metrics that might be used as proxies to measure LLM helpfulness, such as code acceptance or copy rates. To do so, we introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=213) using RealHumanEval in which users interacted with six LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better, human-centric proxy signals. We also open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.
- [629] arXiv:2404.02807 (cross-list from physics.med-ph) [ pdf , ps , html , other ]
-
Title: An Optimization Framework to Personalize Passive Cardiac MechanicsSubjects: Medical Physics (physics.med-ph) ; Artificial Intelligence (cs.AI)
Abstract: Personalized cardiac mechanics modeling is a powerful tool for understanding the biomechanics of cardiac function in health and disease and assisting in treatment planning. However, current models are limited to using medical images acquired at a single cardiac phase, often limiting their applicability for processing dynamic image acquisitions. This study introduces an inverse finite element analysis (iFEA) framework to estimate the passive mechanical properties of cardiac tissue using time-dependent medical image data. The iFEA framework relies on a novel nested optimization scheme, in which the outer iterations utilize a traditional optimization method to best approximate material parameters that fit image data, while the inner iterations employ an augmented Sellier's algorithm to estimate the stress-free reference configuration. With a focus on characterizing the passive mechanical behavior, the framework employs structurally based anisotropic hyperelastic constitutive models and physiologically relevant boundary conditions to simulate myocardial mechanics. We use a stabilized variational multiscale formulation for solving the governing nonlinear elastodynamics equations, verified for cardiac mechanics applications. The framework is tested in myocardium models of biventricle and left atrium derived from cardiac phase-resolved computed tomographic (CT) images of a healthy subject and three patients with hypertrophic obstructive cardiomyopathy (HOCM). The impact of the choice of optimization methods and other numerical settings, including fiber direction parameters, mesh size, initial parameters for optimization, and perturbations to optimal material parameters, is assessed using a rigorous sensitivity analysis. The performance of the current iFEA is compared against an assumed power-law-based pressure-volume relation, typically used for single-phase image acquisition.
- [630] arXiv:2404.02817 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Survey of Optimization-based Task and Motion Planning: From Classical To Learning ApproachesComments: 24 pages, 12 figures, submitted for reviewSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Task and Motion Planning (TAMP) integrates high-level task planning and low-level motion planning to equip robots with the autonomy to effectively reason over long-horizon, dynamic tasks. Optimization-based TAMP focuses on hybrid optimization approaches that define goal conditions via objective functions and are capable of handling open-ended goals, robotic dynamics, and physical interaction between the robot and the environment. Therefore, optimization-based TAMP is particularly suited to solve highly complex, contact-rich locomotion and manipulation problems. This survey provides a comprehensive review on optimization-based TAMP, covering (i) planning domain representations, including action description languages and temporal logic, (ii) individual solution strategies for components of TAMP, including AI planning and trajectory optimization (TO), and (iii) the dynamic interplay between logic-based task planning and model-based TO. A particular focus of this survey is to highlight the algorithm structures to efficiently solve TAMP, especially hierarchical and distributed approaches. Additionally, the survey emphasizes the synergy between the classical methods and contemporary learning-based innovations such as large language models. Furthermore, the future research directions for TAMP is discussed in this survey, highlighting both algorithmic and application-specific challenges.
- [631] arXiv:2404.02823 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The ability of large language models (LLMs) to follow instructions is crucial to real-world applications. Despite recent advances, several studies have highlighted that LLMs struggle when faced with challenging instructions, especially those that include complex constraints, hindering their effectiveness in various tasks. To address this challenge, we introduce Conifer, a novel instruction tuning dataset, designed to enhance LLMs to follow multi-level instructions with complex constraints. Utilizing GPT-4, we curate the dataset by a series of LLM-driven refinement processes to ensure high quality. We also propose a progressive learning scheme that emphasizes an easy-to-hard progression, and learning from process feedback. Models trained with Conifer exhibit remarkable improvements in instruction-following abilities, especially for instructions with complex constraints. On several instruction-following benchmarks, our 7B model outperforms the state-of-the-art open-source 7B models, even exceeds the performance of models 10 times larger on certain metrics. All the code and Conifer dataset are available at this https URL .
- [632] arXiv:2404.02830 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Interpretability of Vertebrae Fracture Grading using Human-interpretable PrototypesPoulami Sinhamahapatra , Suprosanna Shit , Anjany Sekuboyina , Malek Husseini , David Schinz , Nicolas Lenhart , Joern Menze , Jan Kirschke , Karsten Roscher , Stephan GuennemannSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Vertebral fracture grading classifies the severity of vertebral fractures, which is a challenging task in medical imaging and has recently attracted Deep Learning (DL) models. Only a few works attempted to make such models human-interpretable despite the need for transparency and trustworthiness in critical use cases like DL-assisted medical diagnosis. Moreover, such models either rely on post-hoc methods or additional annotations. In this work, we propose a novel interpretable-by-design method, ProtoVerse, to find relevant sub-parts of vertebral fractures (prototypes) that reliably explain the model's decision in a human-understandable way. Specifically, we introduce a novel diversity-promoting loss to mitigate prototype repetitions in small datasets with intricate semantics. We have experimented with the VerSe'19 dataset and outperformed the existing prototype-based method. Further, our model provides superior interpretability against the post-hoc method. Importantly, expert radiologists validated the visual interpretability of our results, showing clinical applicability.
- [633] arXiv:2404.02869 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Human Activity Recognition using SmartphonesMayur Sonawane , Sahil Rajesh Dhayalkar , Siddesh Waje , Soyal Markhelkar , Akshay Wattamwar , Seema C. ShrawneJournal-ref: International Journal of Engineering Science and Computing, October 2018Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Human Activity Recognition is a subject of great research today and has its applications in remote healthcare, activity tracking of the elderly or the disables, calories burnt tracking etc. In our project, we have created an Android application that recognizes the daily human activities and calculate the calories burnt in real time. We first captured labeled triaxial acceleration readings for different daily human activities from the smartphone's embedded accelerometer. These readings were preprocessed using a median filter. 42 features were extracted using various methods. We then tested various machine learning algorithms along with dimensionality reduction. Finally, in our Android application, we used the machine learning algorithm and a subset of features that provided maximum accuracy and minimum model building time. This is used for real-time activity recognition and calculation of calories burnt using a formula based on Metabolic Equivalent.
- [634] arXiv:2404.02877 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: FlightScope: A Deep Comprehensive Assessment of Aircraft Detection Algorithms in Satellite ImagerySafouane El Ghazouali , Arnaud Gucciardi , Nicola Venturi , Michael Rueegsegger , Umberto MichelucciComments: 15 figures, 4 tables, comprehensive survey, comparative studySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Object detection in remotely sensed satellite pictures is fundamental in many fields such as biophysical, and environmental monitoring. While deep learning algorithms are constantly evolving, they have been mostly implemented and tested on popular ground-based taken photos. This paper critically evaluates and compares a suite of advanced object detection algorithms customized for the task of identifying aircraft within satellite imagery. Using the large HRPlanesV2 dataset, together with a rigorous validation with the GDIT dataset, this research encompasses an array of methodologies including YOLO versions 5 and 8, Faster RCNN, CenterNet, RetinaNet, RTMDet, and DETR, all trained from scratch. This exhaustive training and validation study reveal YOLOv5 as the preeminent model for the specific case of identifying airplanes from remote sensing data, showcasing high precision and adaptability across diverse imaging conditions. This research highlight the nuanced performance landscapes of these algorithms, with YOLOv5 emerging as a robust solution for aerial object detection, underlining its importance through superior mean average precision, Recall, and Intersection over Union scores. The findings described here underscore the fundamental role of algorithm selection aligned with the specific demands of satellite imagery analysis and extend a comprehensive framework to evaluate model efficacy. The benchmark toolkit and codes, available via this https URL , aims to further exploration and innovation in the realm of remote sensing object detection, paving the way for improved analytical methodologies in satellite imagery applications.
- [635] arXiv:2404.02883 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: On the Scalability of Diffusion-based Text-to-Image GenerationHao Li , Yang Zou , Ying Wang , Orchid Majumder , Yusheng Xie , R. Manmatha , Ashwin Swaminathan , Zhuowen Tu , Stefano Ermon , Stefano SoattoComments: CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.
- [636] arXiv:2404.02900 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed DatasetsComments: CVPR 2024. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.
- [637] arXiv:2404.02904 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ALOHa: A New Measure for Hallucination in Captioning ModelsSuzanne Petryk , David M. Chan , Anish Kachinthaya , Haodi Zou , John Canny , Joseph E. Gonzalez , Trevor DarrellComments: To appear at NAACL 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at this https URL .
- [638] arXiv:2404.02905 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
- [639] arXiv:2404.02912 (cross-list from cs.CC) [ pdf , ps , html , other ]
-
Title: Probabilistic Generating Circuits -- DemystifiedSubjects: Computational Complexity (cs.CC) ; Artificial Intelligence (cs.AI)
Abstract: Zhang et al. (ICML 2021, PLMR 139, pp. 12447-1245) introduced probabilistic generating circuits (PGCs) as a probabilistic model to unify probabilistic circuits (PCs) and determinantal point processes (DPPs). At a first glance, PGCs store a distribution in a very different way, they compute the probability generating polynomial instead of the probability mass function and it seems that this is the main reason why PGCs are more powerful than PCs or DPPs. However, PGCs also allow for negative weights, whereas classical PCs assume that all weights are nonnegative. One of the main insights of our paper is that the negative weights are responsible for the power of PGCs and not the different representation. PGCs are PCs in disguise, in particular, we show how to transform any PGC into a PC with negative weights with only polynomial blowup.
PGCs were defined by Zhang et al. only for binary random variables. As our second main result, we show that there is a good reason for this: we prove that PGCs for categorial variables with larger image size do not support tractable marginalization unless NP = P. On the other hand, we show that we can model categorial variables with larger image size as PC with negative weights computing set-multilinear polynomials. These allow for tractable marginalization. In this sense, PCs with negative weights strictly subsume PGCs. - [640] arXiv:2404.02923 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: An Unsupervised Adversarial Autoencoder for Cyber Attack Detection in Power Distribution GridsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: Detection of cyber attacks in smart power distribution grids with unbalanced configurations poses challenges due to the inherent nonlinear nature of these uncertain and stochastic systems. It originates from the intermittent characteristics of the distributed energy resources (DERs) generation and load variations. Moreover, the unknown behavior of cyber attacks, especially false data injection attacks (FDIAs) in the distribution grids with complex temporal correlations and the limited amount of labeled data increases the vulnerability of the grids and imposes a high risk in the secure and reliable operation of the grids. To address these challenges, this paper proposes an unsupervised adversarial autoencoder (AAE) model to detect FDIAs in unbalanced power distribution grids integrated with DERs, i.e., PV systems and wind generation. The proposed method utilizes long short-term memory (LSTM) in the structure of the autoencoder to capture the temporal dependencies in the time-series measurements and leverages the power of generative adversarial networks (GANs) for better reconstruction of the input data. The advantage of the proposed data-driven model is that it can detect anomalous points for the system operation without reliance on abstract models or mathematical representations. To evaluate the efficacy of the approach, it is tested on IEEE 13-bus and 123-bus systems with historical meteorological data (wind speed, ambient temperature, and solar irradiance) as well as historical real-world load data under three types of data falsification functions. The comparison of the detection results of the proposed model with other unsupervised learning methods verifies its superior performance in detecting cyber attacks in unbalanced power distribution grids.
- [641] arXiv:2404.02928 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion ModelsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The fast advance of the image generation community has attracted attention worldwide. The safety issue needs to be further scrutinized and studied. There have been a few works around this area mostly achieving a post-processing design, model-specific, or yielding suboptimal image quality generation. Despite that, in this article, we discover a black-box attack method that enjoys three merits. It enables (i)-attacks both directed and semantic-driven that theoretically and practically pose a hazard to this vast user community, (ii)-surprisingly surpasses the white-box attack in a black-box manner and (iii)-without requiring any post-processing effort. Core to our approach is inspired by the concept guidance intriguing property of Classifier-Free guidance (CFG) in T2I models, and we discover that conducting frustratingly simple guidance in the CLIP embedding space, coupled with the semantic loss and an additionally sensitive word list works very well. Moreover, our results expose and highlight the vulnerabilities in existing defense mechanisms.
- [642] arXiv:2404.02929 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Using Large Language Models to Understand Telecom StandardsAthanasios Karapantelakis , Mukesh Thakur , Alexandros Nikou , Farnaz Moradi , Christian Orlog , Fitsum Gaim , Henrik Holm , Doumitrou Daniil Nimara , Vincent HuangComments: Accepted to ICMLCN 2024, Stockholm, May 2024. Updating typo in authors listSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The Third Generation Partnership Project (3GPP) has successfully introduced standards for global mobility. However, the volume and complexity of these standards has increased over time, thus complicating access to relevant information for vendors and service providers. Use of Generative Artificial Intelligence (AI) and in particular Large Language Models (LLMs), may provide faster access to relevant information. In this paper, we evaluate the capability of state-of-art LLMs to be used as Question Answering (QA) assistants for 3GPP document reference. Our contribution is threefold. First, we provide a benchmark and measuring methods for evaluating performance of LLMs. Second, we do data preprocessing and fine-tuning for one of these LLMs and provide guidelines to increase accuracy of the responses that apply to all LLMs. Third, we provide a model of our own, TeleRoBERTa, that performs on-par with foundation LLMs but with an order of magnitude less number of parameters. Results show that LLMs can be used as a credible reference tool on telecom technical documents, and thus have potential for a number of different applications from troubleshooting and maintenance, to network operations and software product development.
- [643] arXiv:2404.02931 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: READ: Improving Relation Extraction from an ADversarial PerspectiveComments: Accepted by findings of NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent works in relation extraction (RE) have achieved promising benchmark accuracy; however, our adversarial attack experiments show that these works excessively rely on entities, making their generalization capability questionable. To address this issue, we propose an adversarial training method specifically designed for RE. Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations. Furthermore, we introduce a probabilistic strategy for leaving clean tokens in the context during adversarial training. This strategy enables a larger attack budget for entities and coaxes the model to leverage relational patterns embedded in the context. Extensive experiments show that compared to various adversarial training methods, our method significantly improves both the accuracy and robustness of the model. Additionally, experiments on different data availability settings highlight the effectiveness of our method in low-resource scenarios. We also perform in-depth analyses of our proposed method and provide further hints. We will release our code at this https URL .
- [644] arXiv:2404.02933 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: NL2KQL: From Natural Language to Kusto QueryAmir H. Abdi , Xinye Tang , Jeremias Eichelbaum , Mahan Das , Alex Klein , Nihal Irmak Pakis , William Blum , Daniel L Mace , Tanvi Raja , Namrata Padmanabhan , Ye XingSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses large language models (LLMs) to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.
- [645] arXiv:2404.02934 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: GreedLlama: Performance of Financial Value-Aligned Large Language Models in Moral ReasoningComments: 9 pages, 1 figureSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: This paper investigates the ethical implications of aligning Large Language Models (LLMs) with financial optimization, through the case study of GreedLlama, a model fine-tuned to prioritize economically beneficial outcomes. By comparing GreedLlama's performance in moral reasoning tasks to a base Llama2 model, our results highlight a concerning trend: GreedLlama demonstrates a marked preference for profit over ethical considerations, making morally appropriate decisions at significantly lower rates than the base model in scenarios of both low and high moral ambiguity. In low ambiguity situations, GreedLlama's ethical decisions decreased to 54.4%, compared to the base model's 86.9%, while in high ambiguity contexts, the rate was 47.4% against the base model's 65.1%. These findings emphasize the risks of single-dimensional value alignment in LLMs, underscoring the need for integrating broader ethical values into AI development to ensure decisions are not solely driven by financial incentives. The study calls for a balanced approach to LLM deployment, advocating for the incorporation of ethical considerations in models intended for business applications, particularly in light of the absence of regulatory oversight.
- [646] arXiv:2404.02935 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual CheckingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper introduces KnowHalu, a novel approach for detecting hallucinations in text generated by large language models (LLMs), utilizing step-wise reasoning, multi-formulation query, multi-form knowledge for factual checking, and fusion-based detection mechanism. As LLMs are increasingly applied across various domains, ensuring that their outputs are not hallucinated is critical. Recognizing the limitations of existing approaches that either rely on the self-consistency check of LLMs or perform post-hoc fact-checking without considering the complexity of queries or the form of knowledge, KnowHalu proposes a two-phase process for hallucination detection. In the first phase, it identifies non-fabrication hallucinations--responses that, while factually correct, are irrelevant or non-specific to the query. The second phase, multi-form based factual checking, contains five key steps: reasoning and query decomposition, knowledge retrieval, knowledge optimization, judgment generation, and judgment aggregation. Our extensive evaluations demonstrate that KnowHalu significantly outperforms SOTA baselines in detecting hallucinations across diverse tasks, e.g., improving by 15.65% in QA tasks and 5.50% in summarization tasks, highlighting its effectiveness and versatility in detecting hallucinations in LLM-generated content.
- [647] arXiv:2404.02937 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Responsible and Reliable Traffic Flow Prediction with Large Language ModelsComments: 27pages, 8 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Traffic forecasting is crucial for intelligent transportation systems. It has experienced significant advancements thanks to the power of deep learning in capturing latent patterns of traffic data. However, recent deep-learning architectures require intricate model designs and lack an intuitive understanding of the mapping from input data to predicted results. Achieving both accuracy and responsibility in traffic prediction models remains a challenge due to the complexity of traffic data and the inherent opacity of deep learning models. To tackle these challenges, we propose a Responsible and Reliable Traffic flow forecasting model with Large Language Models (R2T-LLM), which leverages large language models (LLMs) to generate responsible traffic predictions. By transferring multi-modal traffic data into natural language descriptions, R2T-LLM captures complex spatial-temporal patterns and external factors from comprehensive traffic data. The LLM framework is fine-tuned using language-based instructions to align with spatial-temporal traffic flow data. Empirically, R2T-LLM shows competitive accuracy compared with deep learning baselines, while providing an intuitive and reliable explanation for predictions. We discuss the spatial-temporal and input dependencies for conditional future flow forecasting, showcasing R2T-LLM's potential for diverse city prediction tasks. This paper contributes to advancing accountable traffic prediction models and lays a foundation for future exploration of LLM applications in transportation. To the best of our knowledge, this is the first study to use LLM for accountable and reliable prediction of traffic flows.
- [648] arXiv:2404.02942 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Decision Predicate Graphs: Enhancing Interpretability in Tree EnsemblesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Understanding the decisions of tree-based ensembles and their relationships is pivotal for machine learning model interpretation. Recent attempts to mitigate the human-in-the-loop interpretation challenge have explored the extraction of the decision structure underlying the model taking advantage of graph simplification and path emphasis. However, while these efforts enhance the visualisation experience, they may either result in a visually complex representation or compromise the interpretability of the original ensemble model. In addressing this challenge, especially in complex scenarios, we introduce the Decision Predicate Graph (DPG) as a model-agnostic tool to provide a global interpretation of the model. DPG is a graph structure that captures the tree-based ensemble model and learned dataset details, preserving the relations among features, logical decisions, and predictions towards emphasising insightful points. Leveraging well-known graph theory concepts, such as the notions of centrality and community, DPG offers additional quantitative insights into the model, complementing visualisation techniques, expanding the problem space descriptions, and offering diverse possibilities for extensions. Empirical experiments demonstrate the potential of DPG in addressing traditional benchmarks and complex classification scenarios.
- [649] arXiv:2404.02943 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning in Convolutional Neural Networks Accelerated by Transfer EntropyJournal-ref: Entropy - MDPI, Year 2021, Number 9, Article Number 1218, PubMedID 34573843, ISSN 1099-4300Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Abstract: Recently, there is a growing interest in applying Transfer Entropy (TE) in quantifying the effective connectivity between artificial neurons. In a feedforward network, the TE can be used to quantify the relationships between neuron output pairs located in different layers. Our focus is on how to include the TE in the learning mechanisms of a Convolutional Neural Network (CNN) architecture. We introduce a novel training mechanism for CNN architectures which integrates the TE feedback connections. Adding the TE feedback parameter accelerates the training process, as fewer epochs are needed. On the flip side, it adds computational overhead to each epoch. According to our experiments on CNN classifiers, to achieve a reasonable computational overhead--accuracy trade-off, it is efficient to consider only the inter-neural information transfer of a random subset of the neuron pairs from the last two fully connected layers. The TE acts as a smoothing factor, generating stability and becoming active only periodically, not after processing each input sample. Therefore, we can consider the TE is in our model a slowly changing meta-parameter.
- [650] arXiv:2404.02944 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Foundation Models for Structural Health MonitoringLuca Benfenati , Daniele Jahier Pagliari , Luca Zanatta , Yhorman Alexander Bedoya Velez , Andrea Acquaviva , Massimo Poncino , Enrico Macii , Luca Benini , Alessio BurrelloComments: 16 pages, 4 tables, 9 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Structural Health Monitoring (SHM) is a critical task for ensuring the safety and reliability of civil infrastructures, typically realized on bridges and viaducts by means of vibration monitoring. In this paper, we propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for SHM. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training, which, coupled with task-specific fine-tuning, allows them to outperform state-of-the-art traditional methods on diverse tasks, including Anomaly Detection (AD) and Traffic Load Estimation (TLE). We then extensively explore model size versus accuracy trade-offs and experiment with Knowledge Distillation (KD) to improve the performance of smaller Transformers, enabling their embedding directly into the SHM edge nodes. We showcase the effectiveness of our foundation models using data from three operational viaducts. For AD, we achieve a near-perfect 99.9% accuracy with a monitoring time span of just 15 windows. In contrast, a state-of-the-art method based on Principal Component Analysis (PCA) obtains its first good result (95.03% accuracy) only considering 120 windows. On two different TLE tasks, our models obtain state-of-the-art performance on multiple evaluation metrics (R$^2$ score, MAE% and MSE%). On the first benchmark, we achieve an R$^2$ score of 0.97 and 0.85 for light and heavy vehicle traffic, respectively, while the best previous approach stops at 0.91 and 0.84. On the second one, we achieve an R$^2$ score of 0.54 versus the 0.10 of the best existing method.
- [651] arXiv:2404.02945 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Optimizing the Deployment of Tiny Transformers on Low-Power MCUsComments: Pre-print manuscript submitted for review to the IEEE Transactions on ComputersSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Abstract: Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.
- [652] arXiv:2404.02947 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision QuantizationComments: The 25th International Symposium on Quality Electronic Design (ISQED'24)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model accuracy. Our proposed technique, named Post-Training Intra-Layer Multi-Precision Quantization (PTILMPQ), employs a post-training quantization approach, eliminating the need for extensive training data. By estimating the importance of layers and channels within the network, the proposed method enables precise bit allocation throughout the quantization process. Experimental results demonstrate that PTILMPQ offers a promising solution for deploying DNNs on edge devices with restricted memory resources. For instance, in the case of ResNet50, it achieves an accuracy of 74.57\% with a memory footprint of 9.5 MB, representing a 25.49\% reduction compared to previous similar methods, with only a minor 1.08\% decrease in accuracy.
- [653] arXiv:2404.02948 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: As the parameters of LLMs expand, the computational cost of fine-tuning the entire model becomes prohibitive. To address this challenge, we introduce a PEFT method, Principal Singular values and Singular vectors Adaptation (PiSSA), which optimizes a significantly reduced parameter space while achieving or surpassing the performance of full-parameter fine-tuning. PiSSA is inspired by Intrinsic SAID, which suggests that pre-trained, over-parametrized models inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a matrix W within the model by the product of two trainable matrices A and B, plus a residual matrix $W^{res}$ for error correction. SVD is employed to factorize W, and the principal singular values and vectors of W are utilized to initialize A and B. The residual singular values and vectors initialize the residual matrix $W^{res}$, which keeps frozen during fine-tuning. Notably, PiSSA shares the same architecture with LoRA. However, LoRA approximates Delta W through the product of two matrices, A, initialized with Gaussian noise, and B, initialized with zeros, while PiSSA initializes A and B with principal singular values and vectors of the original matrix W. PiSSA can better approximate the outcomes of full-parameter fine-tuning at the beginning by changing the essential parts while freezing the "noisy" parts. In comparison, LoRA freezes the original matrix and updates the "noise". This distinction enables PiSSA to convergence much faster than LoRA and also achieve better performance in the end. Due to the same architecture, PiSSA inherits many of LoRA's advantages, such as parameter efficiency and compatibility with quantization. Leveraging a fast SVD method, the initialization of PiSSA takes only a few seconds, inducing negligible cost of switching LoRA to PiSSA.
- [654] arXiv:2404.02949 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level InterpretabilityStephen Casper , Jieun Yun , Joonhyuk Baek , Yeseong Jung , Minhwan Kim , Kiwan Kwon , Saerom Park , Hayden Moore , David Shriver , Marissa Connor , Keltin Grimes , Angus Nicolson , Arush Tagade , Jessica Rumbelow , Hieu Minh Nguyen , Dylan Hadfield-MenellComments: Competition for SaTML 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured competition entries. It remains challenging to help humans reliably diagnose trojans via interpretability tools. However, the competition's entries have contributed new techniques and set a new record on the benchmark from Casper et al., 2023.
- [655] arXiv:2404.02954 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New ConnectionsGabriel Loaiza-Ganem , Brendan Leigh Ross , Rasa Hosseinzadeh , Anthony L. Caterini , Jesse C. CresswellSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: In recent years there has been increased interest in understanding the interplay between deep generative models (DGMs) and the manifold hypothesis. Research in this area focuses on understanding the reasons why commonly-used DGMs succeed or fail at learning distributions supported on unknown low-dimensional manifolds, as well as developing new models explicitly designed to account for manifold-supported data. This manifold lens provides both clarity as to why some DGMs (e.g. diffusion models and some generative adversarial networks) empirically surpass others (e.g. likelihood-based models such as variational autoencoders, normalizing flows, or energy-based models) at sample generation, and guidance for devising more performant DGMs. We carry out the first survey of DGMs viewed through this lens, making two novel contributions along the way. First, we formally establish that numerical instability of high-dimensional likelihoods is unavoidable when modelling low-dimensional data. We then show that DGMs on learned representations of autoencoders can be interpreted as approximately minimizing Wasserstein distance: this result, which applies to latent diffusion models, helps justify their outstanding empirical results. The manifold lens provides a rich perspective from which to understand DGMs, which we aim to make more accessible and widespread.
- [656] arXiv:2404.02990 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ASAP: Interpretable Analysis and Summarization of AI-generated Image Patterns at ScaleComments: 9 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Generative image models have emerged as a promising technology to produce realistic images. Despite potential benefits, concerns grow about its misuse, particularly in generating deceptive images that could raise significant ethical, legal, and societal issues. Consequently, there is growing demand to empower users to effectively discern and comprehend patterns of AI-generated images. To this end, we developed ASAP, an interactive visualization system that automatically extracts distinct patterns of AI-generated images and allows users to interactively explore them via various views. To uncover fake patterns, ASAP introduces a novel image encoder, adapted from CLIP, which transforms images into compact "distilled" representations, enriched with information for differentiating authentic and fake images. These representations generate gradients that propagate back to the attention maps of CLIP's transformer block. This process quantifies the relative importance of each pixel to image authenticity or fakeness, exposing key deceptive patterns. ASAP enables the at scale interactive analysis of these patterns through multiple, coordinated visualizations. This includes a representation overview with innovative cell glyphs to aid in the exploration and qualitative evaluation of fake patterns across a vast array of images, as well as a pattern view that displays authenticity-indicating patterns in images and quantifies their impact. ASAP supports the analysis of cutting-edge generative models with the latest architectures, including GAN-based models like proGAN and diffusion models like the latent diffusion model. We demonstrate ASAP's usefulness through two usage scenarios using multiple fake image detection benchmark datasets, revealing its ability to identify and understand hidden patterns in AI-generated images, especially in detecting fake human faces produced by diffusion-based techniques.
- [657] arXiv:2404.03011 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Transfer learning applications for anomaly detection in wind turbinesComments: 16 pages, 7 figures, preprint submitted to Energy&AISubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Anomaly detection in wind turbines typically involves using normal behaviour models to detect faults early. However, training autoencoder models for each turbine is time-consuming and resource intensive. Thus, transfer learning becomes essential for wind turbines with limited data or applications with limited computational resources. This study examines how cross-turbine transfer learning can be applied to autoencoder-based anomaly detection. Here, autoencoders are combined with constant thresholds for the reconstruction error to determine if input data contains an anomaly. The models are initially trained on one year's worth of data from one or more source wind turbines. They are then fine-tuned using smaller amounts of data from another turbine. Three methods for fine-tuning are investigated: adjusting the entire autoencoder, only the decoder, or only the threshold of the model. The performance of the transfer learning models is compared to baseline models that were trained on one year's worth of data from the target wind turbine. The results of the tests conducted in this study indicate that models trained on data of multiple wind turbines do not improve the anomaly detection capability compared to models trained on data of one source wind turbine. In addition, modifying the model's threshold can lead to comparable or even superior performance compared to the baseline, whereas fine-tuning the decoder or autoencoder further enhances the models' performance.
- [658] arXiv:2404.03021 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Blessing or curse? A survey on the Impact of Generative AI on Fake NewsComments: 16 pages, 2 figures. Submitted to ACM Transactions on Intelligent Systems and Technology (ACM TIST)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Fake news significantly influence our society. They impact consumers, voters, and many other societal groups. While Fake News exist for a centuries, Generative AI brings fake news on a new level. It is now possible to automate the creation of masses of high-quality individually targeted Fake News. On the other end, Generative AI can also help detecting Fake News. Both fields are young but developing fast.
This survey provides a comprehensive examination of the research and practical use of Generative AI for Fake News detection and creation in 2024. Following the Structured Literature Survey approach, the paper synthesizes current results in the following topic clusters 1) enabling technologies, 2) creation of Fake News, 3) case study social media as most relevant distribution channel, 4) detection of Fake News, and 5) deepfakes as upcoming technology.
The article also identifies current challenges and open issues. - [659] arXiv:2404.03023 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Toward Safe Evolution of Artificial Intelligence (AI) based Conversational Agents to Support Adolescent Mental and Sexual Health Knowledge DiscoveryComments: This paper has been peer-reviewed and presented at the "CHI 2024 Workshop on Child-centred AI Design, May 11, 2024, Honolulu, HI, USA."Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Following the recent release of various Artificial Intelligence (AI) based Conversation Agents (CAs), adolescents are increasingly using CAs for interactive knowledge discovery on sensitive topics, including mental and sexual health topics. Exploring such sensitive topics through online search has been an essential part of adolescent development, and CAs can support their knowledge discovery on such topics through human-like dialogues. Yet, unintended risks have been documented with adolescents' interactions with AI-based CAs, such as being exposed to inappropriate content, false information, and/or being given advice that is detrimental to their mental and physical well-being (e.g., to self-harm). In this position paper, we discuss the current landscape and opportunities for CAs to support adolescents' mental and sexual health knowledge discovery. We also discuss some of the challenges related to ensuring the safety of adolescents when interacting with CAs regarding sexual and mental health topics. We call for a discourse on how to set guardrails for the safe evolution of AI-based CAs for adolescents.
- [660] arXiv:2404.03027 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak AttacksSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.
- [661] arXiv:2404.03037 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Model-based Reinforcement Learning for Parameterized Action SpacesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We propose a novel model-based reinforcement learning algorithm -- Dynamics Learning and predictive control with Parameterized Actions (DLPA) -- for Parameterized Action Markov Decision Processes (PAMDPs). The agent learns a parameterized-action-conditioned dynamics model and plans with a modified Model Predictive Path Integral control. We theoretically quantify the difference between the generated trajectory and the optimal trajectory during planning in terms of the value they achieved through the lens of Lipschitz Continuity. Our empirical results on several standard benchmarks show that our algorithm achieves superior sample efficiency and asymptotic performance than state-of-the-art PAMDP methods.
- [662] arXiv:2404.03044 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: The Artificial Intelligence Ontology: LLM-assisted construction of AI concept hierarchiesMarcin P. Joachimiak , Mark A. Miller , J. Harry Caufield , Ryan Ly , Nomi L. Harris , Andrew Tritt , Christopher J. Mungall , Kristofer E. BouchardSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The Artificial Intelligence Ontology (AIO) is a systematization of artificial intelligence (AI) concepts, methodologies, and their interrelations. Developed via manual curation, with the additional assistance of large language models (LLMs), AIO aims to address the rapidly evolving landscape of AI by providing a comprehensive framework that encompasses both technical and ethical aspects of AI technologies. The primary audience for AIO includes AI researchers, developers, and educators seeking standardized terminology and concepts within the AI domain. The ontology is structured around six top-level branches: Networks, Layers, Functions, LLMs, Preprocessing, and Bias, each designed to support the modular composition of AI methods and facilitate a deeper understanding of deep learning architectures and ethical considerations in AI.
AIO's development utilized the Ontology Development Kit (ODK) for its creation and maintenance, with its content being dynamically updated through AI-driven curation support. This approach not only ensures the ontology's relevance amidst the fast-paced advancements in AI but also significantly enhances its utility for researchers, developers, and educators by simplifying the integration of new AI concepts and methodologies.
The ontology's utility is demonstrated through the annotation of AI methods data in a catalog of AI research publications and the integration into the BioPortal ontology resource, highlighting its potential for cross-disciplinary research. The AIO ontology is open source and is available on GitHub ( this https URL ) and BioPortal ( this https URL ). - [663] arXiv:2404.03058 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Automatic Extraction of Linguistic Description from Fuzzy Rule BaseSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Neuro-fuzzy systems are a technique of explainable artificial intelligence (XAI). They elaborate knowledge models as a set of fuzzy rules. Fuzzy sets are crucial components of fuzzy rules. They are used to model linguistic terms. In this paper, we present an automatic extraction of fuzzy rules in the natural English language. Full implementation is available free from a public repository.
- [664] arXiv:2404.03080 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Construction of Functional Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language ModelComments: 11 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The convergence of materials science and artificial intelligence has unlocked new opportunities for gathering, analyzing, and generating novel materials sourced from extensive scientific literature. Despite the potential benefits, persistent challenges such as manual annotation, precise extraction, and traceability issues remain. Large language models have emerged as promising solutions to address these obstacles. This paper introduces Functional Materials Knowledge Graph (FMKG), a multidisciplinary materials science knowledge graph. Through the utilization of advanced natural language processing techniques, extracting millions of entities to form triples from a corpus comprising all high-quality research papers published in the last decade. It organizes unstructured information into nine distinct labels, covering Name, Formula, Acronym, Structure/Phase, Properties, Descriptor, Synthesis, Characterization Method, Application, and Domain, seamlessly integrating papers' Digital Object Identifiers. As the latest structured database for functional materials, FMKG acts as a powerful catalyst for expediting the development of functional materials and a fundation for building a more comprehensive material knowledge graph using full paper text. Furthermore, our research lays the groundwork for practical text-mining-based knowledge management systems, not only in intricate materials systems but also applicable to other specialized domains.
- [665] arXiv:2404.03084 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Rethinking Teacher-Student Curriculum Learning through the Cooperative Mechanics of ExperienceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Abstract: Teacher-Student Curriculum Learning (TSCL) is a curriculum learning framework that draws inspiration from human cultural transmission and learning. It involves a teacher algorithm shaping the learning process of a learner algorithm by exposing it to controlled experiences. Despite its success, understanding the conditions under which TSCL is effective remains challenging. In this paper, we propose a data-centric perspective to analyze the underlying mechanics of the teacher-student interactions in TSCL. We leverage cooperative game theory to describe how the composition of the set of experiences presented by the teacher to the learner, as well as their order, influences the performance of the curriculum that is found by TSCL approaches. To do so, we demonstrate that for every TSCL problem, there exists an equivalent cooperative game, and several key components of the TSCL framework can be reinterpreted using game-theoretic principles. Through experiments covering supervised learning, reinforcement learning, and classical games, we estimate the cooperative values of experiences and use value-proportional curriculum mechanisms to construct curricula, even in cases where TSCL struggles. The framework and experimental setup we present in this work represent a novel foundation for a deeper exploration of TSCL, shedding light on its underlying mechanisms and providing insights into its broader applicability in machine learning.
- [666] arXiv:2404.03085 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Talaria: Interactively Optimizing Machine Learning Models for Efficient InferenceFred Hohman , Chaoqun Wang , Jinmook Lee , Jochen Görtler , Dominik Moritz , Jeffrey P Bigham , Zhile Ren , Cecile Foret , Qi Shan , Xiaoyi ZhangComments: Proceedings of the 2024 ACM CHI Conference on Human Factors in Computing SystemsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: On-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models on devices with limited resources presents a major technical challenge: practitioners need to optimize models and balance hardware metrics such as model size, latency, and power. To help practitioners create efficient ML models, we designed and developed Talaria: a model visualization and optimization system. Talaria enables practitioners to compile models to hardware, interactively visualize model statistics, and simulate optimizations to test the impact on inference metrics. Since its internal deployment two years ago, we have evaluated Talaria using three methodologies: (1) a log analysis highlighting its growth of 800+ practitioners submitting 3,600+ models; (2) a usability survey with 26 users assessing the utility of 20 Talaria features; and (3) a qualitative interview with the 7 most active users about their experience using Talaria.
- [667] arXiv:2404.03088 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Robust Federated Learning for Wireless Networks: A Demonstration with Channel EstimationComments: Submitted to IEEE GLOBECOM 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Abstract: Federated learning (FL) offers a privacy-preserving collaborative approach for training models in wireless networks, with channel estimation emerging as a promising application. Despite extensive studies on FL-empowered channel estimation, the security concerns associated with FL require meticulous attention. In a scenario where small base stations (SBSs) serve as local models trained on cached data, and a macro base station (MBS) functions as the global model setting, an attacker can exploit the vulnerability of FL, launching attacks with various adversarial attacks or deployment tactics. In this paper, we analyze such vulnerabilities, corresponding solutions were brought forth, and validated through simulation.
- [668] arXiv:2404.03098 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human RationalesComments: 27 pages, 22 figures, 8 tables; to appear in NAACL Findings 2024; code and data available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models. While these methods can reflect the model's reasoning, they may not align with human intuition, making the explanations not plausible. In this work, we present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models. This incorporation enhances the plausibility of post-hoc explanations while preserving their faithfulness. Our approach is agnostic to model architectures and explainability methods. We introduce the rationales during model training by augmenting the standard cross-entropy loss with a novel loss function inspired by contrastive learning. By leveraging a multi-objective optimization algorithm, we explore the trade-off between the two loss functions and generate a Pareto-optimal frontier of models that balance performance and plausibility. Through extensive experiments involving diverse models, datasets, and explainability methods, we demonstrate that our approach significantly enhances the quality of model explanations without causing substantial (sometimes negligible) degradation in the original model's performance.
- [669] arXiv:2404.03099 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator NetworksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Theory (cs.IT); Machine Learning (stat.ML)
Abstract: Operator learning is a rising field of scientific computing where inputs or outputs of a machine learning model are functions defined in infinite-dimensional spaces. In this paper, we introduce NEON (Neural Epistemic Operator Networks), an architecture for generating predictions with uncertainty using a single operator network backbone, which presents orders of magnitude less trainable parameters than deep ensembles of comparable performance. We showcase the utility of this method for sequential decision-making by examining the problem of composite Bayesian Optimization (BO), where we aim to optimize a function $f=g\circ h$, where $h:X\to C(\mathcal{Y},\mathbb{R}^{d_s})$ is an unknown map which outputs elements of a function space, and $g: C(\mathcal{Y},\mathbb{R}^{d_s})\to \mathbb{R}$ is a known and cheap-to-compute functional. By comparing our approach to other state-of-the-art methods on toy and real world scenarios, we demonstrate that NEON achieves state-of-the-art performance while requiring orders of magnitude less trainable parameters.
- [670] arXiv:2404.03114 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Testing the Effect of Code Documentation on Large Language Model Code UnderstandingComments: 7 pages, 5 figures, 2 tables. Accepted as a Findings paper in the "Generation" track to NAACL 2024. MITRE Public Release Case Number 23-4132Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) have demonstrated impressive abilities in recent years with regards to code generation and understanding. However, little work has investigated how documentation and other code properties affect an LLM's ability to understand and generate code or documentation. We present an empirical analysis of how underlying properties of code or documentation can affect an LLM's capabilities. We show that providing an LLM with "incorrect" documentation can greatly hinder code understanding, while incomplete or missing documentation does not seem to significantly affect an LLM's ability to understand code.
- [671] arXiv:2404.03133 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Framework for Guided Motion PlanningAmnon Attali , Stav Ashur , Isaac Burton Love , Courtney McBeth , James Motes , Marco Morales , Nancy M. AmatoSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Randomized sampling based algorithms are widely used in robot motion planning due to the problem's intractability, and are experimentally effective on a wide range of problem instances. Most variants bias their sampling using various heuristics related to the known underlying structure of the search space. In this work, we formalize the intuitive notion of guided search by defining the concept of a guiding space. This new language encapsulates many seemingly distinct prior methods under the same framework, and allows us to reason about guidance, a previously obscured core contribution of different algorithms. We suggest an information theoretic method to evaluate guidance, which experimentally matches intuition when tested on known algorithms in a variety of environments. The language and evaluation of guidance suggests improvements to existing methods, and allows for simple hybrid algorithms that combine guidance from multiple sources.
- [672] arXiv:2404.03147 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: EigenpruningComments: Extended abstract accepted to LatinX at NAACL 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We introduce eigenpruning, a method that removes singular values from weight matrices in an LLM to improve its performance in a particular task. This method is inspired by interpretability methods designed to automatically find subnetworks of a model which solve a specific task. In our tests, the pruned model outperforms the original model by a large margin, while only requiring minimal computation to prune the weight matrices. In the case of a small synthetic task in integer multiplication, the Phi-2 model can improve its accuracy in the test set from 13.75% to 97.50%. Interestingly, these results seem to indicate the existence of a computation path that can solve the task very effectively, but it was not being used by the original model. Finally, we publicly release our implementation.
- [673] arXiv:2404.03150 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QASubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents our submission to the SemEval 2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure. We present two approaches to solving the task of legal answer validation, given an introduction to the case, a question and an answer candidate. Firstly, we fine-tuned pre-trained BERT-based models and found that models trained on domain knowledge perform better. Secondly, we performed few-shot prompting on GPT models and found that reformulating the answer validation task to be a multiple-choice QA task remarkably improves the performance of the model. Our best submission is a BERT-based model that achieved the 7th place out of 20.
- [674] arXiv:2404.03163 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Uncertainty in Language Models: Assessment through Rank-CalibrationXinmeng Huang , Shuo Li , Mengxin Yu , Matteo Sesia , Hamed Hassani , Insup Lee , Osbert Bastani , Edgar DobribanSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: Language Models (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures ($e.g.$, semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges ($e.g.$, $[0,\infty)$ or $[0,1]$). In this work, we address this issue by developing a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs. Our key tenet is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score ($e.g.$, ROUGE or METEOR). The broad applicability and the granular interpretability of our methods are demonstrated empirically.
- [675] arXiv:2404.03164 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Does Knowledge Graph Really Matter for Recommender Systems?Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recommender systems (RSs) are designed to provide personalized recommendations to users. Recently, knowledge graphs (KGs) have been widely introduced in RSs to improve recommendation accuracy. In this study, however, we demonstrate that RSs do not necessarily perform worse even if the KG is downgraded to the user-item interaction graph only (or removed). We propose an evaluation framework KG4RecEval to systematically evaluate how much a KG contributes to the recommendation accuracy of a KG-based RS, using our defined metric KGER (KG utilization efficiency in recommendation). We consider the scenarios where knowledge in a KG gets completely removed, randomly distorted and decreased, and also where recommendations are for cold-start users. Our extensive experiments on four commonly used datasets and a number of state-of-the-art KG-based RSs reveal that: to remove, randomly distort or decrease knowledge does not necessarily decrease recommendation accuracy, even for cold-start users. These findings inspire us to rethink how to better utilize knowledge from existing KGs, whereby we discuss and provide insights into what characteristics of datasets and KG-based RSs may help improve KG utilization efficiency.
- [676] arXiv:2404.03184 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Death of Feature Engineering? BERT with Linguistic Features on SQuAD 2.0Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Machine reading comprehension is an essential natural language processing task, which takes into a pair of context and query and predicts the corresponding answer to query. In this project, we developed an end-to-end question answering model incorporating BERT and additional linguistic features. We conclude that the BERT base model will be improved by incorporating the features. The EM score and F1 score are improved 2.17 and 2.14 compared with BERT(base). Our best single model reaches EM score 76.55 and F1 score 79.97 in the hidden test set. Our error analysis also shows that the linguistic architecture can help model understand the context better in that it can locate answers that BERT only model predicted "No Answer" wrongly.
- [677] arXiv:2404.03189 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language ModelsComments: 19 pages, 2 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, large language models (LLMs) can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.
- [678] arXiv:2404.03190 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Adaptive Discrete Disparity Volume for Self-supervised Monocular Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: In self-supervised monocular depth estimation tasks, discrete disparity prediction has been proven to attain higher quality depth maps than common continuous methods. However, current discretization strategies often divide depth ranges of scenes into bins in a handcrafted and rigid manner, limiting model performance. In this paper, we propose a learnable module, Adaptive Discrete Disparity Volume (ADDV), which is capable of dynamically sensing depth distributions in different RGB images and generating adaptive bins for them. Without any extra supervision, this module can be integrated into existing CNN architectures, allowing networks to produce representative values for bins and a probability volume over them. Furthermore, we introduce novel training strategies - uniformizing and sharpening - through a loss term and temperature parameter, respectively, to provide regularizations under self-supervised conditions, preventing model degradation or collapse. Empirical results demonstrate that ADDV effectively processes global information, generating appropriate bins for various scenes and producing higher quality depth maps compared to handcrafted methods.
- [679] arXiv:2404.03204 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech SynthesisDetai Xin , Xu Tan , Kai Shen , Zeqian Ju , Dongchao Yang , Yuancheng Wang , Shinnosuke Takamichi , Hiroshi Saruwatari , Shujie Liu , Jinyu Li , Sheng ZhaoSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $6.3\%$ (without reranking) and $2.1\%$ (with reranking) to $2.8\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.
- [680] arXiv:2404.03239 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Exploring Emotions in Multi-componential Space using Interactive VR GamesComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Emotion understanding is a complex process that involves multiple components. The ability to recognise emotions not only leads to new context awareness methods but also enhances system interaction's effectiveness by perceiving and expressing emotions. Despite the attention to discrete and dimensional models, neuroscientific evidence supports those emotions as being complex and multi-faceted. One framework that resonated well with such findings is the Component Process Model (CPM), a theory that considers the complexity of emotions with five interconnected components: appraisal, expression, motivation, physiology and feeling. However, the relationship between CPM and discrete emotions has not yet been fully explored. Therefore, to better understand emotions underlying processes, we operationalised a data-driven approach using interactive Virtual Reality (VR) games and collected multimodal measures (self-reports, physiological and facial signals) from 39 participants. We used Machine Learning (ML) methods to identify the unique contributions of each component to emotion differentiation. Our results showed the role of different components in emotion differentiation, with the model including all components demonstrating the most significant contribution. Moreover, we found that at least five dimensions are needed to represent the variation of emotions in our dataset. These findings also have implications for using VR environments in emotion research and highlight the role of physiological signals in emotion recognition within such environments.
- [681] arXiv:2404.03253 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentationYin Li , Qi Chen , Kai Wang , Meige Li , Liping Si , Yingwei Guo , Yu Xiong , Qixing Wang , Yang Qin , Ling Xu , Patrick van der Smagt , Jun Tang , Nutan ChenSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Multi-modality magnetic resonance imaging data with various sequences facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.
- [682] arXiv:2404.03259 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Enhancing the Performance of Aspect-Based Sentiment Analysis SystemsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Aspect-based sentiment analysis aims to predict sentiment polarity with fine granularity. While Graph Convolutional Networks (GCNs) are widely utilized for sentimental feature extraction, their naive application for syntactic feature extraction can compromise information preservation. This study introduces an innovative edge-enhanced GCN, named SentiSys, to navigate the syntactic graph while preserving intact feature information, leading to enhanced performance. Specifically,we first integrate a bidirectional long short-term memory (Bi-LSTM) network and a self-attention-based transformer. This combination facilitates effective text encoding, preventing the loss of information and predicting long dependency text. A bidirectional GCN (Bi-GCN) with message passing is then employed to encode relationships between entities. Additionally, unnecessary information is filtered out using an aspect-specific masking technique. To validate the effectiveness of our proposed model, we conduct extensive evaluation experiments and ablation studies on four benchmark datasets. The results consistently demonstrate improved performance in aspect-based sentiment analysis when employing SentiSys. This approach successfully addresses the challenges associated with syntactic feature extraction, highlighting its potential for advancing sentiment analysis methodologies.
- [683] arXiv:2404.03263 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small ModelsComments: ICLR 2024. 5th Workshop on Practical ML for Low Resource Settings (PML4LRS). Code can be found at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we propose that small models may not need to absorb the cost of pre-training to reap its benefits. Instead, they can capitalize on the astonishing results achieved by modern, enormous models to a surprising degree. We observe that, when distilled on a task from a pre-trained teacher model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task. To allow this phenomenon to be easily leveraged, we establish a connection reducing knowledge distillation to modern contrastive learning, opening two doors: (1) vastly different model architecture pairings can work for the distillation, and (2) most contrastive learning algorithms rooted in the theory of Noise Contrastive Estimation can be easily applied and used. We demonstrate this paradigm using pre-trained teacher models from open-source model hubs, Transformer and convolution based model combinations, and a novel distillation algorithm that massages the Alignment/Uniformity perspective of contrastive learning by Wang & Isola (2020) into a distillation objective. We choose this flavor of contrastive learning due to its low computational cost, an overarching theme of this work. We also observe that this phenomenon tends not to occur if the task is data-limited. However, this can be alleviated by leveraging yet another scale-inspired development: large, pre-trained generative models for dataset augmentation. Again, we use an open-source model, and our rudimentary prompts are sufficient to boost the small model`s performance. Thus, we highlight a training method for small models that is up to 94% faster than the standard pre-training paradigm without sacrificing performance. For practitioners discouraged from fully utilizing modern foundation datasets for their small models due to the prohibitive scale, we believe our work keeps that door open.
- [684] arXiv:2404.03264 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future DirectionsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Foundation model, which is pre-trained on broad data and is able to adapt to a wide range of tasks, is advancing healthcare. It promotes the development of healthcare artificial intelligence (AI) models, breaking the contradiction between limited AI models and diverse healthcare practices. Much more widespread healthcare scenarios will benefit from the development of a healthcare foundation model (HFM), improving their advanced intelligent healthcare services. Despite the impending widespread deployment of HFMs, there is currently a lack of clear understanding about how they work in the healthcare field, their current challenges, and where they are headed in the future. To answer these questions, a comprehensive and deep survey of the challenges, opportunities, and future directions of HFMs is presented in this survey. It first conducted a comprehensive overview of the HFM including the methods, data, and applications for a quick grasp of the current progress. Then, it made an in-depth exploration of the challenges present in data, algorithms, and computing infrastructures for constructing and widespread application of foundation models in healthcare. This survey also identifies emerging and promising directions in this field for future development. We believe that this survey will enhance the community's comprehension of the current progress of HFM and serve as a valuable source of guidance for future development in this field. The latest HFM papers and related resources are maintained on our website: this https URL .
- [685] arXiv:2404.03275 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: DELTA: Decomposed Efficient Long-Term Robot Task Planning using Large Language ModelsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in Large Language Models (LLMs) have sparked a revolution across various research fields. In particular, the integration of common-sense knowledge from LLMs into robot task and motion planning has been proven to be a game-changer, elevating performance in terms of explainability and downstream task efficiency to unprecedented heights. However, managing the vast knowledge encapsulated within these large models has posed challenges, often resulting in infeasible plans generated by LLM-based planning systems due to hallucinations or missing domain information. To overcome these challenges and obtain even greater planning feasibility and computational efficiency, we propose a novel LLM-driven task planning approach called DELTA. For achieving better grounding from environmental topology into actionable knowledge, DELTA leverages the power of scene graphs as environment representations within LLMs, enabling the fast generation of precise planning problem descriptions. For obtaining higher planning performance, we use LLMs to decompose the long-term task goals into an autoregressive sequence of sub-goals for an automated task planner to solve. Our contribution enables a more efficient and fully automatic task planning pipeline, achieving higher planning success rates and significantly shorter planning times compared to the state of the art.
- [686] arXiv:2404.03276 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: A Deep Reinforcement Learning Approach for Security-Aware Service Acquisition in IoTSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The novel Internet of Things (IoT) paradigm is composed of a growing number of heterogeneous smart objects and services that are transforming architectures and applications, increasing systems' complexity, and the need for reliability and autonomy. In this context, both smart objects and services are often provided by third parties which do not give full transparency regarding the security and privacy of the features offered. Although machine-based Service Level Agreements (SLA) have been recently leveraged to establish and share policies in Cloud-based scenarios, and also in the IoT context, the issue of making end users aware of the overall system security levels and the fulfillment of their privacy requirements through the provision of the requested service remains a challenging task. To tackle this problem, we propose a complete framework that defines suitable levels of privacy and security requirements in the acquisition of services in IoT, according to the user needs. Through the use of a Reinforcement Learning based solution, a user agent, inside the environment, is trained to choose the best smart objects granting access to the target services. Moreover, the solution is designed to guarantee deadline requirements and user security and privacy needs. Finally, to evaluate the correctness and the performance of the proposed approach we illustrate an extensive experimental analysis.
- [687] arXiv:2404.03304 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric FactorsComments: 33 pages, 18 tables, and 10 figures. Our code is available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The conversational recommendation system (CRS) has been criticized regarding its user experience in real-world scenarios, despite recent significant progress achieved in academia. Existing evaluation protocols for CRS may prioritize system-centric factors such as effectiveness and fluency in conversation while neglecting user-centric aspects. Thus, we propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors. We conceptualise three key characteristics in representing such factors and further divide them into six primary abilities. To implement Concept, we adopt a LLM-based user simulator and evaluator with scoring rubrics that are tailored for each primary ability. Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models. Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS, thereby setting the foundation for CRS improvement.
- [688] arXiv:2404.03323 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive LearningComments: 23 pages, 1 algorithm, 36 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBMs). While SOTA approaches to Image Classification task work as a black box, there is a growing demand for models that would provide interpreted results. Such a models often learn to predict the distribution over class labels using additional description of this target instances, called concepts. However, existing Bottleneck methods have a number of limitations: their accuracy is lower than that of a standard model and CBMs require an additional set of concepts to leverage. We provide a framework for creating Concept Bottleneck Model from pre-trained multi-modal encoder and new CLIP-like architectures. By introducing a new type of layers known as Concept Bottleneck Layers, we outline three methods for training them: with $\ell_1$-loss, contrastive loss and loss function based on Gumbel-Softmax distribution (Sparse-CBM), while final FC layer is still trained with Cross-Entropy. We show a significant increase in accuracy using sparse hidden layers in CLIP-based bottleneck models. Which means that sparse representation of concepts activation vector is meaningful in Concept Bottleneck Models. Moreover, with our Concept Matrix Search algorithm we can improve CLIP predictions on complex datasets without any additional training or fine-tuning. The code is available at: this https URL .
- [689] arXiv:2404.03325 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Embodied Neuromorphic Artificial Intelligence for Robotics: Perspectives, Challenges, and Research Development StackRachmad Vidya Wicaksana Putra , Alberto Marchisio , Fakhreddine Zayer , Jorge Dias , Muhammad ShafiqueComments: 8 pages, 9 figures, 1 tableSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: Robotic technologies have been an indispensable part for improving human productivity since they have been helping humans in completing diverse, complex, and intensive tasks in a fast yet accurate and efficient way. Therefore, robotic technologies have been deployed in a wide range of applications, ranging from personal to industrial use-cases. However, current robotic technologies and their computing paradigm still lack embodied intelligence to efficiently interact with operational environments, respond with correct/expected actions, and adapt to changes in the environments. Toward this, recent advances in neuromorphic computing with Spiking Neural Networks (SNN) have demonstrated the potential to enable the embodied intelligence for robotics through bio-plausible computing paradigm that mimics how the biological brain works, known as "neuromorphic artificial intelligence (AI)". However, the field of neuromorphic AI-based robotics is still at an early stage, therefore its development and deployment for solving real-world problems expose new challenges in different design aspects, such as accuracy, adaptability, efficiency, reliability, and security. To address these challenges, this paper will discuss how we can enable embodied neuromorphic AI for robotic systems through our perspectives: (P1) Embodied intelligence based on effective learning rule, training mechanism, and adaptability; (P2) Cross-layer optimizations for energy-efficient neuromorphic computing; (P3) Representative and fair benchmarks; (P4) Low-cost reliability and safety enhancements; (P5) Security and privacy for neuromorphic computing; and (P6) A synergistic development for energy-efficient and robust neuromorphic-based robotics. Furthermore, this paper identifies research challenges and opportunities, as well as elaborates our vision for future research development toward embodied neuromorphic AI for robotics.
- [690] arXiv:2404.03348 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Knowledge Distillation-Based Model Extraction Attack using Private Counterfactual ExplanationsComments: 15 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Abstract: In recent years, there has been a notable increase in the deployment of machine learning (ML) models as services (MLaaS) across diverse production software applications. In parallel, explainable AI (XAI) continues to evolve, addressing the necessity for transparency and trustworthiness in ML models. XAI techniques aim to enhance the transparency of ML models by providing insights, in terms of the model's explanations, into their decision-making process. Simultaneously, some MLaaS platforms now offer explanations alongside the ML prediction outputs. This setup has elevated concerns regarding vulnerabilities in MLaaS, particularly in relation to privacy leakage attacks such as model extraction attacks (MEA). This is due to the fact that explanations can unveil insights about the inner workings of the model which could be exploited by malicious users. In this work, we focus on investigating how model explanations, particularly Generative adversarial networks (GANs)-based counterfactual explanations (CFs), can be exploited for performing MEA within the MLaaS platform. We also delve into assessing the effectiveness of incorporating differential privacy (DP) as a mitigation strategy. To this end, we first propose a novel MEA methodology based on Knowledge Distillation (KD) to enhance the efficiency of extracting a substitute model of a target model exploiting CFs. Then, we advise an approach for training CF generators incorporating DP to generate private CFs. We conduct thorough experimental evaluations on real-world datasets and demonstrate that our proposed KD-based MEA can yield a high-fidelity substitute model with reduced queries with respect to baseline approaches. Furthermore, our findings reveal that the inclusion of a privacy layer impacts the performance of the explainer, the quality of CFs, and results in a reduction in the MEA performance.
- [691] arXiv:2404.03354 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Comprehensive Survey on Self-Supervised Learning for RecommendationSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Recommender systems play a crucial role in tackling the challenge of information overload by delivering personalized recommendations based on individual user preferences. Deep learning techniques, such as RNNs, GNNs, and Transformer architectures, have significantly propelled the advancement of recommender systems by enhancing their comprehension of user behaviors and preferences. However, supervised learning methods encounter challenges in real-life scenarios due to data sparsity, resulting in limitations in their ability to learn representations effectively. To address this, self-supervised learning (SSL) techniques have emerged as a solution, leveraging inherent data structures to generate supervision signals without relying solely on labeled data. By leveraging unlabeled data and extracting meaningful representations, recommender systems utilizing SSL can make accurate predictions and recommendations even when confronted with data sparsity. In this paper, we provide a comprehensive review of self-supervised learning frameworks designed for recommender systems, encompassing a thorough analysis of over 170 papers. We conduct an exploration of nine distinct scenarios, enabling a comprehensive understanding of SSL-enhanced recommenders in different contexts. For each domain, we elaborate on different self-supervised learning paradigms, namely contrastive learning, generative learning, and adversarial learning, so as to present technical details of how SSL enhances recommender systems in various contexts. We consistently maintain the related open-source materials at this https URL .
- [692] arXiv:2404.03359 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement LearningPhilipp Altmann , Céline Davignon , Maximilian Zorn , Fabian Ritz , Claudia Linnhoff-Popien , Thomas GaborComments: 12 pages, 12 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: To enhance the interpretability of Reinforcement Learning (RL), we propose Revealing Evolutionary Action Consequence Trajectories (REACT). In contrast to the prevalent practice of validating RL models based on their optimal behavior learned during training, we posit that considering a range of edge-case trajectories provides a more comprehensive understanding of their inherent behavior. To induce such scenarios, we introduce a disturbance to the initial state, optimizing it through an evolutionary algorithm to generate a diverse population of demonstrations. To evaluate the fitness of trajectories, REACT incorporates a joint fitness function that encourages both local and global diversity in the encountered states and chosen actions. Through assessments with policies trained for varying durations in discrete and continuous environments, we demonstrate the descriptive power of REACT. Our results highlight its effectiveness in revealing nuanced aspects of RL models' behavior beyond optimal performance, thereby contributing to improved interpretability.
- [693] arXiv:2404.03382 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DIDA: Denoised Imitation Learning based on Domain AdaptationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Imitating skills from low-quality datasets, such as sub-optimal demonstrations and observations with distractors, is common in real-world applications. In this work, we focus on the problem of Learning from Noisy Demonstrations (LND), where the imitator is required to learn from data with noise that often occurs during the processes of data collection or transmission. Previous IL methods improve the robustness of learned policies by injecting an adversarially learned Gaussian noise into pure expert data or utilizing additional ranking information, but they may fail in the LND setting. To alleviate the above problems, we propose Denoised Imitation learning based on Domain Adaptation (DIDA), which designs two discriminators to distinguish the noise level and expertise level of data, facilitating a feature encoder to learn task-related but domain-agnostic representations. Experiment results on MuJoCo demonstrate that DIDA can successfully handle challenging imitation tasks from demonstrations with various types of noise, outperforming most baseline methods.
- [694] arXiv:2404.03386 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: SENSOR: Imitate Third-Person Expert's Behaviors via Active SensoringSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In many real-world visual Imitation Learning (IL) scenarios, there is a misalignment between the agent's and the expert's perspectives, which might lead to the failure of imitation. Previous methods have generally solved this problem by domain alignment, which incurs extra computation and storage costs, and these methods fail to handle the \textit{hard cases} where the viewpoint gap is too large. To alleviate the above problems, we introduce active sensoring in the visual IL setting and propose a model-based SENSory imitatOR (SENSOR) to automatically change the agent's perspective to match the expert's. SENSOR jointly learns a world model to capture the dynamics of latent states, a sensor policy to control the camera, and a motor policy to control the agent. Experiments on visual locomotion tasks show that SENSOR can efficiently simulate the expert's perspective and strategy, and outperforms most baseline methods.
- [695] arXiv:2404.03414 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-ThoughtComments: This paper is accepted to LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., <1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks. Specifically, the lightweight LM first generates a rationale for each input instance. The Frozen large LM is then prompted to predict a task output based on the rationale generated by the lightweight LM. Our approach is resource-efficient in the sense that it only requires training the lightweight LM. We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals. We assess our method with multi-hop extractive question answering (QA) benchmarks, HotpotQA, and 2WikiMultiHopQA. Experimental results show that our approach outperforms all baselines regarding answer prediction accuracy. We also find that reinforcement learning helps the model to produce higher-quality rationales with improved QA performance.
- [696] arXiv:2404.03418 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: Permissible Knowledge PoolingSubjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI)
Abstract: Information pooling has been extensively formalised across various logical frameworks in distributed systems, characterized by diverse information-sharing patterns. These approaches generally adopt an intersection perspective, aggregating all possible information, regardless of whether it is known or unknown to the agents. In contrast, this work adopts a unique stance, emphasising that sharing knowledge means distributing what is known, rather than what remains uncertain. This paper introduces new modal logics for knowledge pooling and sharing, ranging from a novel language of knowledge pooling to a dynamic mechanism for knowledge sharing. It also outlines their axiomatizations and discusses a potential framework for permissible knowledge pooling.
- [697] arXiv:2404.03419 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Integrating Hyperparameter Search into Model-Free AutoML with Context-Free GrammarsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Automated Machine Learning (AutoML) has become increasingly popular in recent years due to its ability to reduce the amount of time and expertise required to design and develop machine learning systems. This is very important for the practice of machine learning, as it allows building strong baselines quickly, improving the efficiency of the data scientists, and reducing the time to production. However, despite the advantages of AutoML, it faces several challenges, such as defining the solutions space and exploring it efficiently. Recently, some approaches have been shown to be able to do it using tree-based search algorithms and context-free grammars. In particular, GramML presents a model-free reinforcement learning approach that leverages pipeline configuration grammars and operates using Monte Carlo tree search. However, one of the limitations of GramML is that it uses default hyperparameters, limiting the search problem to finding optimal pipeline structures for the available data preprocessors and models. In this work, we propose an extension to GramML that supports larger search spaces including hyperparameter search. We evaluated the approach using an OpenML benchmark and found significant improvements compared to other state-of-the-art techniques.
- [698] arXiv:2404.03425 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space ModelSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings. Recently, the Mamba architecture, based on state space models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures. In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing CD tasks. We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively. All three frameworks adopt the cutting-edge Visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images. For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multi-temporal features, thereby obtaining accurate change information. On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex training strategies or tricks, fully demonstrating the potential of the Mamba architecture in CD tasks. Specifically, we obtained 83.11%, 88.39% and 94.19% F1 scores on the three BCD datasets SYSU, LEVIR-CD+, and WHU-CD; on the SCD dataset SECOND, we obtained 24.11% SeK; and on the BDA dataset xBD, we obtained 81.41% overall F1 score. Further experiments show that our architecture is quite robust to degraded data. The source code will be available in this https URL
- [699] arXiv:2404.03474 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Performance of computer vision algorithms for fine-grained classification using crowdsourced insect imagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time-consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT), and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost, and gradient activity. We offer insights that we haven't yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that the ViT performs the best on inference speed and computational cost while the LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.
- [700] arXiv:2404.03491 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue GenerationComments: Accepted by LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Empowered by the large-scale pretrained language models, existing dialogue systems have demonstrated impressive performance conducting fluent and natural-sounding conversations. However, they are still plagued by the hallucination problem, causing unpredictable factual errors in the generated responses. Recently, knowledge-grounded dialogue generation models, that intentionally invoke external knowledge resources to more informative responses, are also proven to be effective in reducing hallucination. Following the idea of getting high-quality knowledge, a few efforts have achieved pretty good performance on this issue. As some inevitable knowledge noises may also lead to hallucinations, it is emergent to investigate the reason and future directions for building noise-tolerant methods in KGD tasks. In this paper, we analyze the causal story behind this problem with counterfactual reasoning methods. Based on the causal effect analysis, we propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction. Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance, while keeping adaptive to different generation models. We hope our efforts can support and call for more attention to developing lightweight techniques towards robust and trusty dialogue systems.
- [701] arXiv:2404.03493 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: A Methodology to Study the Impact of Spiking Neural Network Parameters considering Event-Based Automotive DataComments: 7 pages, 13 figures, 1 tableSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Autonomous Driving (AD) systems are considered as the future of human mobility and transportation. Solving computer vision tasks such as image classification and object detection/segmentation, with high accuracy and low power/energy consumption, is highly needed to realize AD systems in real life. These requirements can potentially be satisfied by Spiking Neural Networks (SNNs). However, the state-of-the-art works in SNN-based AD systems still focus on proposing network models that can achieve high accuracy, and they have not systematically studied the roles of SNN parameters when used for learning event-based automotive data. Therefore, we still lack understanding of how to effectively develop SNN models for AD systems. Toward this, we propose a novel methodology to systematically study and analyze the impact of SNN parameters considering event-based automotive data, then leverage this analysis for enhancing SNN developments. To do this, we first explore different settings of SNN parameters that directly affect the learning mechanism (i.e., batch size, learning rate, neuron threshold potential, and weight decay), then analyze the accuracy results. Afterward, we propose techniques that jointly improve SNN accuracy and reduce training time. Experimental results show that our methodology can improve the SNN models for AD systems than the state-of-the-art, as it achieves higher accuracy (i.e., 86%) for the NCARS dataset, and it can also achieve iso-accuracy (i.e., ~85% with standard deviation less than 0.5%) while speeding up the training time by 1.9x. In this manner, our research work provides a set of guidelines for SNN parameter enhancements, thereby enabling the practical developments of SNN-based AD systems.
- [702] arXiv:2404.03514 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Learn When (not) to Trust Language Models: A Privacy-Centric Adaptive Model-Aware ApproachSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks. Despite their great success, the knowledge provided by the retrieval process is not always useful for improving the model prediction, since in some samples LLMs may already be quite knowledgeable and thus be able to answer the question correctly without retrieval. Aiming to save the cost of retrieval, previous work has proposed to determine when to do/skip the retrieval in a data-aware manner by analyzing the LLMs' pretraining data. However, these data-aware methods pose privacy risks and memory limitations, especially when requiring access to sensitive or extensive pretraining data. Moreover, these methods offer limited adaptability under fine-tuning or continual learning settings. We hypothesize that token embeddings are able to capture the model's intrinsic knowledge, which offers a safer and more straightforward way to judge the need for retrieval without the privacy risks associated with accessing pre-training data. Moreover, it alleviates the need to retain all the data utilized during model pre-training, necessitating only the upkeep of the token embeddings. Extensive experiments and in-depth analyses demonstrate the superiority of our model-aware approach.
- [703] arXiv:2404.03543 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: CodeEditorBench: Evaluating Code Editing Capability of Large Language ModelsJiawei Guo , Ziming Li , Xueling Liu , Kaijing Ma , Tianyu Zheng , Zhouliang Yu , Ding Pan , Yizhi LI , Ruibo Liu , Yue Wang , Shuyue Guo , Xingwei Qu , Xiang Yue , Ge Zhang , Wenhu Chen , Jie FuSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.
- [704] arXiv:2404.03549 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Alzheimer's disease detection in PSG signalsLorena Gallego-Viñarás (1), Juan Miguel Mira-Tomás (1), Anna Michela-Gaeta (3), Gerard Pinol-Ripoll (4), Ferrán Barbé (4 and 5), Pablo M. Olmos (2 and 6), Arrate Muñoz-Barrutia (1 and 2) ((1) Bioengineering Department, Universidad Carlos III de Madrid, (2) Instituto de Investigación Sanitaria Gregorio Marañón (IiSGM), (3) Department of Pulmunology, Hospital Universitario Severo Ochoa, (4) Hospital Universitari Santa Maria Lleida, (5) Hospital Universitari Arnau de Vilanova, (6) Signal Processing Group (GTS), Universidad Carlos III de Madrid)Comments: 12 pages, 14 figures. Submitted to IEEE Biomedical and Health Informatics for publicationSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: Alzheimer's disease (AD) and sleep disorders exhibit a close association, where disruptions in sleep patterns often precede the onset of Mild Cognitive Impairment (MCI) and early-stage AD. This study delves into the potential of utilizing sleep-related electroencephalography (EEG) signals acquired through polysomnography (PSG) for the early detection of AD. Our primary focus is on exploring semi-supervised Deep Learning techniques for the classification of EEG signals due to the clinical scenario characterized by the limited data availability. The methodology entails testing and comparing the performance of semi-supervised SMATE and TapNet models, benchmarked against the supervised XCM model, and unsupervised Hidden Markov Models (HMMs). The study highlights the significance of spatial and temporal analysis capabilities, conducting independent analyses of each sleep stage. Results demonstrate the effectiveness of SMATE in leveraging limited labeled data, achieving stable metrics across all sleep stages, and reaching 90% accuracy in its supervised form. Comparative analyses reveal SMATE's superior performance over TapNet and HMM, while XCM excels in supervised scenarios with an accuracy range of 92 - 94%. These findings underscore the potential of semi-supervised models in early AD detection, particularly in overcoming the challenges associated with the scarcity of labeled data. Ablation tests affirm the critical role of spatio-temporal feature extraction in semi-supervised predictive performance, and t-SNE visualizations validate the model's proficiency in distinguishing AD patterns. Overall, this research contributes to the advancement of AD detection through innovative Deep Learning approaches, highlighting the crucial role of semi-supervised learning in addressing data limitations.
- [705] arXiv:2404.03574 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained DevicesComments: Accepted as a full paper by the tinyML Research Symposium 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Traditional machine learning models often require powerful hardware, making them unsuitable for deployment on resource-limited devices. Tiny Machine Learning (tinyML) has emerged as a promising approach for running machine learning models on these devices, but integrating multiple data modalities into tinyML models still remains a challenge due to increased complexity, latency, and power consumption. This paper proposes TinyVQA, a novel multimodal deep neural network for visual question answering tasks that can be deployed on resource-constrained tinyML hardware. TinyVQA leverages a supervised attention-based model to learn how to answer questions about images using both vision and language modalities. Distilled knowledge from the supervised attention-based VQA model trains the memory aware compact TinyVQA model and low bit-width quantization technique is employed to further compress the model for deployment on tinyML devices. The TinyVQA model was evaluated on the FloodNet dataset, which is used for post-disaster damage assessment. The compact model achieved an accuracy of 79.5%, demonstrating the effectiveness of TinyVQA for real-world applications. Additionally, the model was deployed on a Crazyflie 2.0 drone, equipped with an AI deck and GAP8 microprocessor. The TinyVQA model achieved low latencies of 56 ms and consumes 693 mW power while deployed on the tiny drone, showcasing its suitability for resource-constrained embedded systems.
- [706] arXiv:2404.03587 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Anticipate & Collab: Data-driven Task Anticipation and Knowledge-driven Planning for Human-robot CollaborationShivam Singh , Karthik Swaminathan , Raghav Arora , Ramandeep Singh , Ahana Datta , Dipanjan Das , Snehasis Banerjee , Mohan Sridharan , Madhava KrishnaSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: An agent assisting humans in daily living activities can collaborate more effectively by anticipating upcoming tasks. Data-driven methods represent the state of the art in task anticipation, planning, and related problems, but these methods are resource-hungry and opaque. Our prior work introduced a proof of concept framework that used an LLM to anticipate 3 high-level tasks that served as goals for a classical planning system that computed a sequence of low-level actions for the agent to achieve these goals. This paper describes DaTAPlan, our framework that significantly extends our prior work toward human-robot collaboration. Specifically, DaTAPlan planner computes actions for an agent and a human to collaboratively and jointly achieve the tasks anticipated by the LLM, and the agent automatically adapts to unexpected changes in human action outcomes and preferences. We evaluate DaTAPlan capabilities in a realistic simulation environment, demonstrating accurate task anticipation, effective human-robot collaboration, and the ability to adapt to unexpected changes. Project website: this https URL
- [707] arXiv:2404.03590 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SemGrasp: Semantic Grasp Generation via Language Aligned DiscretizationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generating natural human grasps necessitates consideration of not just object geometry but also semantic information. Solely depending on object shape for grasp generation confines the applications of prior methods in downstream tasks. This paper presents a novel semantic-based grasp generation method, termed SemGrasp, which generates a static human grasp pose by incorporating semantic information into the grasp representation. We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures in accordance with language instructions. A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space. To facilitate the training of SemGrasp, we have compiled a large-scale, grasp-text-aligned dataset named CapGrasp, featuring about 260k detailed captions and 50k diverse grasps. Experimental findings demonstrate that SemGrasp efficiently generates natural human grasps in alignment with linguistic intentions. Our code, models, and dataset are available publicly at: this https URL .
- [708] arXiv:2404.03592 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: ReFT: Representation Finetuning for Language ModelsZhengxuan Wu , Aryaman Arora , Zheng Wang , Atticus Geiger , Dan Jurafsky , Christopher D. Manning , Christopher PottsComments: 40 pages, preprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. Here, we pursue this hypothesis by developing a family of $\textbf{Representation Finetuning (ReFT)}$ methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). LoReFT is a drop-in replacement for existing PEFTs and learns interventions that are 10x-50x more parameter-efficient than prior state-of-the-art PEFTs. We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, Alpaca-Eval v1.0, and GLUE. In all these evaluations, LoReFT delivers the best balance of efficiency and performance, and almost always outperforms state-of-the-art PEFTs. We release a generic ReFT training library publicly at this https URL .
- [709] arXiv:2404.03596 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Laser Learning Environment: A new environment for coordination-critical multi-agent tasksComments: Pre-print, 21 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritized experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.
- [710] arXiv:2404.03606 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Analyzing Musical Characteristics of National Anthems in Relation to Global IndicesS M Rakib Hasan , Aakar Dhakal , Ms. Ayesha Siddiqua , Mohammad Mominur Rahman , Md Maidul Islam , Mohammed Arfat Raihan Chowdhury , S M Masfequier Rahman Swapno , SM Nuruzzaman NobelSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
Abstract: Music plays a huge part in shaping peoples' psychology and behavioral patterns. This paper investigates the connection between national anthems and different global indices with computational music analysis and statistical correlation analysis. We analyze national anthem musical data to determine whether certain musical characteristics are associated with peace, happiness, suicide rate, crime rate, etc. To achieve this, we collect national anthems from 169 countries and use computational music analysis techniques to extract pitch, tempo, beat, and other pertinent audio features. We then compare these musical characteristics with data on different global indices to ascertain whether a significant correlation exists. Our findings indicate that there may be a correlation between the musical characteristics of national anthems and the indices we investigated. The implications of our findings for music psychology and policymakers interested in promoting social well-being are discussed. This paper emphasizes the potential of musical data analysis in social research and offers a novel perspective on the relationship between music and social indices. The source code and data are made open-access for reproducibility and future research endeavors. It can be accessed at this http URL .
- [711] arXiv:2404.03608 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Sailor: Open Language Models for South-East AsiaComments: Code is available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.
- [712] arXiv:2404.03611 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: InsectMamba: Insect Pest Classification with State Space ModelComments: 13 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The classification of insect pests is a critical task in agricultural technology, vital for ensuring food security and environmental sustainability. However, the complexity of pest identification, due to factors like high camouflage and species diversity, poses significant obstacles. Existing methods struggle with the fine-grained feature extraction needed to distinguish between closely related pest species. Although recent advancements have utilized modified network structures and combined deep learning approaches to improve accuracy, challenges persist due to the similarity between pests and their surroundings. To address this problem, we introduce InsectMamba, a novel approach that integrates State Space Models (SSMs), Convolutional Neural Networks (CNNs), Multi-Head Self-Attention mechanism (MSA), and Multilayer Perceptrons (MLPs) within Mix-SSM blocks. This integration facilitates the extraction of comprehensive visual features by leveraging the strengths of each encoding strategy. A selective module is also proposed to adaptively aggregate these features, enhancing the model's ability to discern pest characteristics. InsectMamba was evaluated against strong competitors across five insect pest classification datasets. The results demonstrate its superior performance and verify the significance of each model component by an ablation study.
- [713] arXiv:2404.03623 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge GraphComments: Preprint. Under review. 10 pages, 7 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Large Language Models (LLMs) demonstrate an impressive capacity to recall a vast range of common factual knowledge information. However, unravelling the underlying reasoning of LLMs and explaining their internal mechanisms of exploiting this factual knowledge remain active areas of investigation. Our work analyzes the factual knowledge encoded in the latent representation of LLMs when prompted to assess the truthfulness of factual claims. We propose an end-to-end framework that jointly decodes the factual knowledge embedded in the latent space of LLMs from a vector space to a set of ground predicates and represents its evolution across the layers using a temporal knowledge graph. Our framework relies on the technique of activation patching which intervenes in the inference computation of a model by dynamically altering its latent representations. Consequently, we neither rely on external models nor training processes. We showcase our framework with local and global interpretability analyses using two claim verification datasets: FEVER and CLIMATE-FEVER. The local interpretability analysis exposes different latent errors from representation to multi-hop reasoning errors. On the other hand, the global analysis uncovered patterns in the underlying evolution of the model's factual knowledge (e.g., store-and-seek factual information). By enabling graph-based analyses of the latent representations, this work represents a step towards the mechanistic interpretability of LLMs.
- [714] arXiv:2404.03635 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: WorDepth: Variational Language Prior for Monocular Depth EstimationZiyao Zeng , Daniel Wang , Fengyu Yang , Hyoungseob Park , Yangchao Wu , Stefano Soatto , Byung-Woo Hong , Dong Lao , Alex WongSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.
- [715] arXiv:2404.03647 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 UltraDarioush Kevian , Usman Syed , Xingang Guo , Aaron Havens , Geir Dullerud , Peter Seiler , Lianhui Qin , Bin HuSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.
- [716] arXiv:2404.03653 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept MatchingDongzhi Jiang , Guanglu Song , Xiaoshi Wu , Renrui Zhang , Dazhong Shen , Zhuofan Zong , Yu Liu , Hongsheng LiComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.
- [717] arXiv:2404.03657 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OW-VISCap: Open-World Video Instance Segmentation and CaptioningComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.
- [718] arXiv:2404.03662 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: X-lifecycle Learning for Cloud Incident Management using LLMsDrishti Goel , Fiza Husain , Aditya Singh , Supriyo Ghosh , Anjaly Parayil , Chetan Bansal , Xuchao Zhang , Saravan RajmohanSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.
- [719] arXiv:2404.03664 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule EngineComments: 12 pages, 6 figures, 4 tables, 1 listingSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The Cancer Registry of Norway (CRN) uses an automated cancer registration support system (CaReSS) to support core cancer registry activities, i.e, data capture, data curation, and producing data products and statistics for various stakeholders. GURI is a core component of CaReSS, which is responsible for validating incoming data with medical rules. Such medical rules are manually implemented by medical experts based on medical standards, regulations, and research. Since large language models (LLMs) have been trained on a large amount of public information, including these documents, they can be employed to generate tests for GURI. Thus, we propose an LLM-based test generation and differential testing approach (LLMeDiff) to test GURI. We experimented with four different LLMs, two medical rule engine implementations, and 58 real medical rules to investigate the hallucination, success, time efficiency, and robustness of the LLMs to generate tests, and these tests' ability to find potential issues in GURI. Our results showed that GPT-3.5 hallucinates the least, is the most successful, and is generally the most robust; however, it has the worst time efficiency. Our differential testing revealed 22 medical rules where implementation inconsistencies were discovered (e.g., regarding handling rule versions). Finally, we provide insights for practitioners and researchers based on the results.
- [720] arXiv:2404.03665 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Serial Parallel Reliability Redundancy Allocation Optimization for Energy Efficient and Fault Tolerant Cloud ComputingComments: 5 Pages, 1 Figure, 2 TablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the inactive copy can step in as a backup right away, this provides continuous performance and uninterrupted operation. This approach is called parallel redundancy, otherwise known as active-active redundancy, and its exceptional when it comes to strategy. It creates duplicates of a system or service that are all running at once. By doing this fault tolerance increases since if one copy fails, the workload can be distributed across any replica thats functioning properly. Reliability allocation depends on features in a system and the availability and fault tolerance you want from it. Serial redundancy or parallel redundancies can be applied to increase the dependability of systems and services. To demonstrate how well this concept works, we looked into fixed serial parallel reliability redundancy allocation issues followed by using an innovative hybrid optimization technique to find the best possible allocation for peak dependability. We then measured our findings against other research.
- [721] arXiv:2404.03673 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: RL for Consistency Models: Faster Reward Guided Text-to-Image GenerationComments: 17 pages, 8 figures, 1 tableSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion models on text-to-image generation capabilities and trades computation during inference time for sample quality. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Our code is available at this https URL
- [722] arXiv:2404.03683 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Stream of Search (SoS): Learning to Search in LanguageKanishk Gandhi , Denise Lee , Gabriel Grand , Muxin Liu , Winson Cheng , Archit Sharma , Noah D. GoodmanSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
- [723] arXiv:2404.03685 (cross-list from physics.soc-ph) [ pdf , ps , other ]
-
Title: Cooperative Evolutionary Pressure and Diminishing Returns Might Explain the Fermi Paradox: On What Super-AIs Are LikeComments: 23 pages, 1 figure. Added acknowledgement, clarifications, referencesSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI)
Abstract: With an evolutionary approach, the basis of morality can be explained as adaptations to problems of cooperation. With 'evolution' taken in a broad sense, evolving AIs that satisfy the conditions for evolution to apply will be subject to the same cooperative evolutionary pressure as biological entities. Here the adaptiveness of increased cooperation as material safety and wealth increase is discussed -- for humans, for other societies, and for AIs. Diminishing beneficial returns from increased access to material resources also suggests the possibility that, on the whole, there will be no incentive to for instance colonize entire galaxies, thus providing a possible explanation of the Fermi paradox, wondering where everybody is. It is further argued that old societies could engender, give way to, super-AIs, since it is likely that super-AIs are feasible, and fitter. Closing is an aside on effective ways for morals and goals to affect life and society, emphasizing environments, cultures, and laws, and exemplified by how to eat.
Appended are an algorithm for colonizing for example a galaxy quickly, models of the evolution of cooperation and fairness under diminishing returns, and software for simulating signaling development. It is also noted that there can be no exponential colonization or reproduction, for mathematical reasons, as each entity takes up a certain amount of space. - [724] arXiv:2404.03686 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Securing Social Spaces: Harnessing Deep Learning to Eradicate CyberbullyingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: In today's digital world, cyberbullying is a serious problem that can harm the mental and physical health of people who use social media. This paper explains just how serious cyberbullying is and how it really affects indi-viduals exposed to it. It also stresses how important it is to find better ways to detect cyberbullying so that online spaces can be safer. Plus, it talks about how making more accurate tools to spot cyberbullying will be really helpful in the future. Our paper introduces a deep learning-based ap-proach, primarily employing BERT and BiLSTM architectures, to effective-ly address cyberbullying. This approach is designed to analyse large vol-umes of posts and predict potential instances of cyberbullying in online spaces. Our results demonstrate the superiority of the hateBERT model, an extension of BERT focused on hate speech detection, among the five mod-els, achieving an accuracy rate of 89.16%. This research is a significant con-tribution to "Computational Intelligence for Social Transformation," prom-ising a safer and more inclusive digital landscape.
- [725] arXiv:2404.03693 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Improve Knowledge Distillation via Label Revision and Data SelectionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge distillation (KD) has become a widely used technique in the field of model compression, which aims to transfer knowledge from a large teacher model to a lightweight student model for efficient network development. In addition to the supervision of ground truth, the vanilla KD method regards the predictions of the teacher as soft labels to supervise the training of the student model. Based on vanilla KD, various approaches have been developed to further improve the performance of the student model. However, few of these previous methods have considered the reliability of the supervision from teacher models. Supervision from erroneous predictions may mislead the training of the student model. This paper therefore proposes to tackle this problem from two aspects: Label Revision to rectify the incorrect supervision and Data Selection to select appropriate samples for distillation to reduce the impact of erroneous supervision. In the former, we propose to rectify the teacher's inaccurate predictions using the ground truth. In the latter, we introduce a data selection technique to choose suitable training samples to be supervised by the teacher, thereby reducing the impact of incorrect predictions to some extent. Experiment results demonstrate the effectiveness of our proposed method, and show that our method can be combined with other distillation approaches, improving their performance.
- [726] arXiv:2404.03702 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Personalized Federated Learning for Spatio-Temporal Forecasting: A Dual Semantic Alignment-Based Contrastive ApproachSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The existing federated learning (FL) methods for spatio-temporal forecasting fail to capture the inherent spatio-temporal heterogeneity, which calls for personalized FL (PFL) methods to model the spatio-temporally variant patterns. While contrastive learning approach is promising in addressing spatio-temporal heterogeneity, the existing methods are noneffective in determining negative pairs and can hardly apply to PFL paradigm. To tackle this limitation, we propose a novel PFL method, named Federated dUal sEmantic aLignment-based contraStive learning (FUELS), which can adaptively align positive and negative pairs based on semantic similarity, thereby injecting precise spatio-temporal heterogeneity into the latent representation space by auxiliary contrastive tasks. From temporal perspective, a hard negative filtering module is introduced to dynamically align heterogeneous temporal representations for the supplemented intra-client contrastive task. From spatial perspective, we design lightweight-but-efficient prototypes as client-level semantic representations, based on which the server evaluates spatial similarity and yields client-customized global prototypes for the supplemented inter-client contrastive task. Extensive experiments demonstrate that FUELS outperforms state-of-the-art methods, with communication cost decreasing by around 94%.
- [727] arXiv:2404.03703 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Mitigating analytical variability in fMRI results with style transferSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: We propose a novel approach to improve the reproducibility of neuroimaging results by converting statistic maps across different functional MRI pipelines. We make the assumption that pipelines can be considered as a style component of data and propose to use different generative models, among which, Diffusion Models (DM) to convert data between pipelines. We design a new DM-based unsupervised multi-domain image-to-image transition framework and constrain the generation of 3D fMRI statistic maps using the latent space of an auxiliary classifier that distinguishes statistic maps from different pipelines. We extend traditional sampling techniques used in DM to improve the transition performance. Our experiments demonstrate that our proposed methods are successful: pipelines can indeed be transferred, providing an important source of data augmentation for future medical studies.
- [728] arXiv:2404.03704 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Improvement of Performance in Freezing of Gait detection in Parkinsons Disease using Transformer networks and a single waist worn triaxial accelerometerLuis Sigcha , Luigi Borzì , Ignacio Pavón , Nélson Costa , Susana Costa , Pedro Arezes , Juan-Manuel López , Guillermo De ArcasJournal-ref: Engineering Applications of Artificial Intelligence Volume 116, November 2022, 105482Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Freezing of gait (FOG) is one of the most incapacitating symptoms in Parkinsons disease, affecting more than 50 percent of patients in advanced stages of the disease. The presence of FOG may lead to falls and a loss of independence with a consequent reduction in the quality of life. Wearable technology and artificial intelligence have been used for automatic FOG detection to optimize monitoring. However, differences between laboratory and daily-life conditions present challenges for the implementation of reliable detection systems. Consequently, improvement of FOG detection methods remains important to provide accurate monitoring mechanisms intended for free-living and real-time use. This paper presents advances in automatic FOG detection using a single body-worn triaxial accelerometer and a novel classification algorithm based on Transformers and convolutional networks. This study was performed with data from 21 patients who manifested FOG episodes while performing activities of daily living in a home setting. Results indicate that the proposed FOG-Transformer can bring a significant improvement in FOG detection using leave-one-subject-out cross-validation (LOSO CV). These results bring opportunities for the implementation of accurate monitoring systems for use in ambulatory or home settings.
- [729] arXiv:2404.03707 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility StudySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Counterfactual learning to rank (CLTR) has attracted extensive attention in the IR community for its ability to leverage massive logged user interaction data to train ranking models. While the CLTR models can be theoretically unbiased when the user behavior assumption is correct and the propensity estimation is accurate, their effectiveness is usually empirically evaluated via simulation-based experiments due to a lack of widely-available, large-scale, real click logs. However, the mainstream simulation-based experiments are somewhat limited as they often feature a single, deterministic production ranker and simplified user simulation models to generate the synthetic click logs. As a result, the robustness of CLTR models in complex and diverse situations is largely unknown and needs further investigation.
To address this problem, in this paper, we aim to investigate the robustness of existing CLTR models in a reproducibility study with extensive simulation-based experiments that (1) use both deterministic and stochastic production rankers, each with different ranking performance, and (2) leverage multiple user simulation models with different user behavior assumptions. We find that the DLA models and IPS-DCM show better robustness under various simulation settings than IPS-PBM and PRS with offline propensity estimation. Besides, the existing CLTR models often fail to outperform the naive click baselines when the production ranker has relatively high ranking performance or certain randomness, which suggests an urgent need for developing new CLTR algorithms that work for these settings. - [730] arXiv:2404.03709 (cross-list from cs.LO) [ pdf , ps , other ]
-
Title: Proceedings 12th International Workshop on Theorem proving components for Educational softwareJulien Narboux (University of Strasbourg, France), Walther Neuper (JKU, Johannes Kepler University), Pedro Quaresma (University of Coimbra, Portugal)Journal-ref: EPTCS 400, 2024Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The ThEdu series pursues the smooth transition from an intuitive way of doing mathematics at secondary school to a more formal approach to the subject in STEM education, while favouring software support for this transition by exploiting the power of theorem-proving technologies. What follows is a brief description of how the present volume contributes to this enterprise.
The 12th International Workshop on Theorem Proving Components for Educational Software(ThEdu'23), was a satellite event of the 29th international Conference on Automated Deduction (CADE 2023), July 1-4, 2023, Rome, Italy. ThEdu'23 was very successful, with one invited talk, by Yves Bertot (Inria, France), "The challenges of using Type Theory to teach Mathematics", and seven regular contributions. An open call for papers was then issued, to which eight contributions were submitted. Seven submissions have been accepted by our reviewers, who jointly produced at least three careful reports on each of the contributions. The resulting revised papers are collected in the present volume.
We, the volume editors, hope that this collection of papers will further promote the development of theorem-proving based software, and that it will allow to improve the mutual understanding between computer scientists, mathematicians and stakeholders in education.
PC Chairs:Julien Narboux (University of Strasbourg, France); Walther Neuper (JKU, Johannes Kepler University, Linz, Austria); Pedro Quaresma (University of Coimbra, Portugal) - [731] arXiv:2404.03710 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Self-organized arrival system for urban air mobilitySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Urban air mobility is an innovative mode of transportation in which electric vertical takeoff and landing (eVTOL) vehicles operate between nodes called vertiports. We outline a self-organized vertiport arrival system based on deep reinforcement learning. The airspace around the vertiport is assumed to be circular, and the vehicles can freely operate inside. Each aircraft is considered an individual agent and follows a shared policy, resulting in decentralized actions that are based on local information. We investigate the development of the reinforcement learning policy during training and illustrate how the algorithm moves from suboptimal local holding patterns to a safe and efficient final policy. The latter is validated in simulation-based scenarios and also deployed on small-scale unmanned aerial vehicles to showcase its real-world usability.
- [732] arXiv:2404.03713 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Explaining Explainability: Understanding Concept Activation VectorsComments: (54 pages, 39 figures)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Abstract: Recent interpretability methods propose using concept-based explanations to translate the internal representations of deep learning models into a language that humans are familiar with: concepts. This requires understanding which concepts are present in the representation space of a neural network. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs. CAVs may be: (1) inconsistent between layers, (2) entangled with different concepts, and (3) spatially dependent. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how they affect the derived explanations, and provide recommendations to minimise their impact. Understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on ImageNet and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.
- [733] arXiv:2404.03714 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: SpikeExplorer: hardware-oriented Design Space Exploration for Spiking Neural Networks on FPGASubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: One of today's main concerns is to bring Artificial Intelligence power to embedded systems for edge applications. The hardware resources and power consumption required by state-of-the-art models are incompatible with the constrained environments observed in edge systems, such as IoT nodes and wearable devices. Spiking Neural Networks (SNNs) can represent a solution in this sense: inspired by neuroscience, they reach unparalleled power and resource efficiency when run on dedicated hardware accelerators. However, when designing such accelerators, the amount of choices that can be taken is huge. This paper presents SpikExplorer, a modular and flexible Python tool for hardware-oriented Automatic Design Space Exploration to automate the configuration of FPGA accelerators for SNNs. Using Bayesian optimizations, SpikerExplorer enables hardware-centric multi-objective optimization, supporting factors such as accuracy, area, latency, power, and various combinations during the exploration process. The tool searches the optimal network architecture, neuron model, and internal and training parameters, trying to reach the desired constraints imposed by the user. It allows for a straightforward network configuration, providing the full set of explored points for the user to pick the trade-off that best fits the needs. The potential of SpikExplorer is showcased using three benchmark datasets. It reaches 95.8% accuracy on the MNIST dataset, with a power consumption of 180mW/image and a latency of 0.12 ms/image, making it a powerful tool for automatically optimizing SNNs.
- [734] arXiv:2404.03715 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Direct Nash Optimization: Teaching Language Models to Self-Improve with General PreferencesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.
- [735] arXiv:2404.03732 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination DetectionComments: 6 pages, 6 figures, 4 tables, camera-ready copy, accepted to the 18th International Workshop on Semantic Evaluation (SemEval-2024), for associated code and data see this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system's classification decisions were consistent with those of the crowd-sourced human labellers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github.
- [736] arXiv:2404.03745 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Fakes of Varying Shades: How Warning Affects Human Perception and Engagement Regarding LLM HallucinationsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The widespread adoption and transformative effects of large language models (LLMs) have sparked concerns regarding their capacity to produce inaccurate and fictitious content, referred to as `hallucinations'. Given the potential risks associated with hallucinations, humans should be able to identify them. This research aims to understand the human perception of LLM hallucinations by systematically varying the degree of hallucination (genuine, minor hallucination, major hallucination) and examining its interaction with warning (i.e., a warning of potential inaccuracies: absent vs. present). Participants (N=419) from Prolific rated the perceived accuracy and engaged with content (e.g., like, dislike, share) in a Q/A format. Results indicate that humans rank content as truthful in the order genuine > minor hallucination > major hallucination and user engagement behaviors mirror this pattern. More importantly, we observed that warning improves hallucination detection without significantly affecting the perceived truthfulness of genuine content. We conclude by offering insights for future tools to aid human detection of hallucinations.
- [737] arXiv:2404.03746 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query ReformulationComments: Accepted at ECIR 2024Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Query Reformulation(QR) is a set of techniques used to transform a user's original search query to a text that better aligns with the user's intent and improves their search experience. Recently, zero-shot QR has been shown to be a promising approach due to its ability to exploit knowledge inherent in large language models. By taking inspiration from the success of ensemble prompting strategies which have benefited many tasks, we investigate if they can help improve query reformulation. In this context, we propose an ensemble based prompting technique, GenQREnsemble which leverages paraphrases of a zero-shot instruction to generate multiple sets of keywords ultimately improving retrieval performance. We further introduce its post-retrieval variant, GenQREnsembleRF to incorporate pseudo relevant feedback. On evaluations over four IR benchmarks, we find that GenQREnsemble generates better reformulations with relative nDCG@10 improvements up to 18% and MAP improvements upto 24% over the previous zero-shot state-of-art. On the MSMarco Passage Ranking task, GenQREnsembleRF shows relative gains of 5% MRR using pseudo-relevance feedback, and 9% nDCG@10 using relevant feedback documents.
- [738] arXiv:2404.03753 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: A Reinforcement Learning based Reset Policy for CDCL SAT SolversSubjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Restart policy is an important technique used in modern Conflict-Driven Clause Learning (CDCL) solvers, wherein some parts of the solver state are erased at certain intervals during the run of the solver. In most solvers, variable activities are preserved across restart boundaries, resulting in solvers continuing to search parts of the assignment tree that are not far from the one immediately prior to a restart. To enable the solver to search possibly "distant" parts of the assignment tree, we study the effect of resets, a variant of restarts which not only erases the assignment trail, but also randomizes the activity scores of the variables of the input formula after reset, thus potentially enabling a better global exploration of the search space.
In this paper, we model the problem of whether to trigger reset as a multi-armed bandit (MAB) problem, and propose two reinforcement learning (RL) based adaptive reset policies using the Upper Confidence Bound (UCB) and Thompson sampling algorithms. These two algorithms balance the exploration-exploitation tradeoff by adaptively choosing arms (reset vs. no reset) based on their estimated rewards during the solver's run. We implement our reset policies in four baseline SOTA CDCL solvers and compare the baselines against the reset versions on Satcoin benchmarks and SAT Competition instances. Our results show that RL-based reset versions outperform the corresponding baseline solvers on both Satcoin and the SAT competition instances, suggesting that our RL policy helps to dynamically and profitably adapt the reset frequency for any given input instance. We also introduce the concept of a partial reset, where at least a constant number of variable activities are retained across reset boundaries. Building on previous results, we show that there is an exponential separation between O(1) vs. $\Omega(n)$-length partial resets. - [739] arXiv:2404.03775 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Systems Theoretic Approach to Online Machine LearningComments: Accepted by the 18th Annual IEEE International Systems Conference (SysCon)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: The machine learning formulation of online learning is incomplete from a systems theoretic perspective. Typically, machine learning research emphasizes domains and tasks, and a problem solving worldview. It focuses on algorithm parameters, features, and samples, and neglects the perspective offered by considering system structure and system behavior or dynamics. Online learning is an active field of research and has been widely explored in terms of statistical theory and computational algorithms, however, in general, the literature still lacks formal system theoretical frameworks for modeling online learning systems and resolving systems-related concept drift issues. Furthermore, while the machine learning formulation serves to classify methods and literature, the systems theoretic formulation presented herein serves to provide a framework for the top-down design of online learning systems, including a novel definition of online learning and the identification of key design parameters. The framework is formulated in terms of input-output systems and is further divided into system structure and system behavior. Concept drift is a critical challenge faced in online learning, and this work formally approaches it as part of the system behavior characteristics. Healthcare provider fraud detection using machine learning is used as a case study throughout the paper to ground the discussion in a real-world online learning challenge.
- [740] arXiv:2404.03784 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Layerwise Early Stopping for Test Time AdaptationSabyasachi Sahoo , Mostafa ElAraby , Jonas Ngnawe , Yann Pequignot , Frederic Precioso , Christian GagneComments: 14 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Test Time Adaptation (TTA) addresses the problem of distribution shift by enabling pretrained models to learn new features on an unseen domain at test time. However, it poses a significant challenge to maintain a balance between learning new features and retaining useful pretrained features. In this paper, we propose Layerwise EArly STopping (LEAST) for TTA to address this problem. The key idea is to stop adapting individual layers during TTA if the features being learned do not appear beneficial for the new domain. For that purpose, we propose using a novel gradient-based metric to measure the relevance of the current learnt features to the new domain without the need for supervised labels. More specifically, we propose to use this metric to determine dynamically when to stop updating each layer during TTA. This enables a more balanced adaptation, restricted to layers benefiting from it, and only for a certain number of steps. Such an approach also has the added effect of limiting the forgetting of pretrained features useful for dealing with new domains. Through extensive experiments, we demonstrate that Layerwise Early Stopping improves the performance of existing TTA approaches across multiple datasets, domain shifts, model architectures, and TTA losses.
- [741] arXiv:2404.03789 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Quantifying Uncertainty in Motion Prediction with Variational Bayesian MixtureComments: Accepted at CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Safety and robustness are crucial factors in developing trustworthy autonomous vehicles. One essential aspect of addressing these factors is to equip vehicles with the capability to predict future trajectories for all moving objects in the surroundings and quantify prediction uncertainties. In this paper, we propose the Sequential Neural Variational Agent (SeNeVA), a generative model that describes the distribution of future trajectories for a single moving object. Our approach can distinguish Out-of-Distribution data while quantifying uncertainty and achieving competitive performance compared to state-of-the-art methods on the Argoverse 2 and INTERACTION datasets. Specifically, a 0.446 meters minimum Final Displacement Error, a 0.203 meters minimum Average Displacement Error, and a 5.35% Miss Rate are achieved on the INTERACTION test set. Extensive qualitative and quantitative analysis is also provided to evaluate the proposed model. Our open-source code is available at this https URL .
- [742] arXiv:2404.03799 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Language-Guided Instance-Aware Domain-Adaptive Panoptic SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.
- [743] arXiv:2404.03827 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Uniform Memory Retrieval with Larger Capacity for Modern Hopfield ModelsComments: 64 pages; Code available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: We propose a two-stage memory retrieval dynamics for modern Hopfield models, termed $\mathtt{U\text{-}Hop}$, with enhanced memory capacity. Our key contribution is a learnable feature map $\Phi$ which transforms the Hopfield energy function into a kernel space. This transformation ensures convergence between the local minima of energy and the fixed points of retrieval dynamics within the kernel space. Consequently, the kernel norm induced by $\Phi$ serves as a novel similarity measure. It utilizes the stored memory patterns as learning data to enhance memory capacity across all modern Hopfield models. Specifically, we accomplish this by constructing a separation loss $\mathcal{L}_\Phi$ that separates the local minima of kernelized energy by separating stored memory patterns in kernel space. Methodologically, $\mathtt{U\text{-}Hop}$ memory retrieval process consists of: \textbf{(Stage~I.)} minimizing separation loss for a more uniformed memory (local minimum) distribution, followed by \textbf{(Stage~II.)} standard Hopfield energy minimization for memory retrieval. This results in a significant reduction of possible meta-stable states in the Hopfield energy function, thus enhancing memory capacity by preventing memory confusion. Empirically, with real-world datasets, we demonstrate that $\mathtt{U\text{-}Hop}$ outperforms all existing modern Hopfield models and SOTA similarity measures, achieving substantial improvements in both associative memory retrieval and deep learning tasks.
- [744] arXiv:2404.03828 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Outlier-Efficient Hopfield Layers for Large Transformer-Based ModelsJerry Yao-Chieh Hu , Pei-Hsuan Chang , Robin Luo , Hong-Yu Chen , Weijian Li , Wei-Po Wang , Han LiuComments: 48 pages; Code available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathtt{OutEffHop}$) and use it to address the outlier-induced challenge of quantizing gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism ($\text{Softmax}_1$): it is an approximation of the memory retrieval process of $\mathtt{OutEffHop}$. Methodologically, this allows us to debut novel outlier-efficient Hopfield layers a powerful attention alternative with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of the standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the proposed model's efficacy across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT and STanHop-Net), benchmarking against state-of-the-art methods including $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$. Notably, $\mathtt{OutEffHop}$ achieves on average $\sim$22+\% reductions in both average kurtosis and maximum infinity norm of model outputs accross 4 models.
- [745] arXiv:2404.03830 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield ModelChenwei Xu , Yu-Chao Huang , Jerry Yao-Chieh Hu , Weijian Li , Ammar Gilani , Hsi-Sheng Goan , Han LiuComments: 40 page; Code available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: We introduce the \textbf{B}i-Directional \textbf{S}parse \textbf{Hop}field Network (\textbf{BiSHop}), a novel end-to-end framework for deep tabular learning. BiSHop handles the two major challenges of deep tabular learning: non-rotationally invariant data structure and feature sparsity in tabular data. Our key motivation comes from the recent established connection between associative memory and attention mechanisms. Consequently, BiSHop uses a dual-component approach, sequentially processing data both column-wise and row-wise through two interconnected directional learning modules. Computationally, these modules house layers of generalized sparse modern Hopfield layers, a sparse extension of the modern Hopfield model with adaptable sparsity. Methodologically, BiSHop facilitates multi-scale representation learning, capturing both intra-feature and inter-feature interactions, with adaptive sparsity at each scale. Empirically, through experiments on diverse real-world datasets, we demonstrate that BiSHop surpasses current SOTA methods with significantly less HPO runs, marking it a robust solution for deep tabular learning.
- [746] arXiv:2404.03836 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal ModelComments: 14 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. Our source code, dataset, and trained models are available at this https URL .
- [747] arXiv:2404.03838 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: A Block-Coordinate Descent EMO Algorithm: Theoretical and Empirical AnalysisComments: Accepted at GECCO 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: We consider whether conditions exist under which block-coordinate descent is asymptotically efficient in evolutionary multi-objective optimization, addressing an open problem. Block-coordinate descent, where an optimization problem is decomposed into $k$ blocks of decision variables and each of the blocks is optimized (with the others fixed) in a sequence, is a technique used in some large-scale optimization problems such as airline scheduling, however its use in multi-objective optimization is less studied. We propose a block-coordinate version of GSEMO and compare its running time to the standard GSEMO algorithm. Theoretical and empirical results on a bi-objective test function, a variant of LOTZ, serve to demonstrate the existence of cases where block-coordinate descent is faster. The result may yield wider insights into this class of algorithms.
- [748] arXiv:2404.03868 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph ConstructionComments: 15 pages, 2 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schema easily exceed the LLMs' context window length. To address this problem, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works.
- [749] arXiv:2404.03869 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Heterogeneous Multi-Agent Reinforcement Learning for Zero-Shot Scalable CollaborationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
Abstract: The rise of multi-agent systems, especially the success of multi-agent reinforcement learning (MARL), is reshaping our future across diverse domains like autonomous vehicle networks. However, MARL still faces significant challenges, particularly in achieving zero-shot scalability, which allows trained MARL models to be directly applied to unseen tasks with varying numbers of agents. In addition, real-world multi-agent systems usually contain agents with different functions and strategies, while the existing scalable MARL methods only have limited heterogeneity. To address this, we propose a novel MARL framework named Scalable and Heterogeneous Proximal Policy Optimization (SHPPO), integrating heterogeneity into parameter-shared PPO-based MARL networks. we first leverage a latent network to adaptively learn strategy patterns for each agent. Second, we introduce a heterogeneous layer for decision-making, whose parameters are specifically generated by the learned latent variables. Our approach is scalable as all the parameters are shared except for the heterogeneous layer, and gains both inter-individual and temporal heterogeneity at the same time. We implement our approach based on the state-of-the-art backbone PPO-based algorithm as SHPPO, while our approach is agnostic to the backbone and can be seamlessly plugged into any parameter-shared MARL method. SHPPO exhibits superior performance over the baselines such as MAPPO and HAPPO in classic MARL environments like Starcraft Multi-Agent Challenge (SMAC) and Google Research Football (GRF), showcasing enhanced zero-shot scalability and offering insights into the learned latent representation's impact on team performance by visualization.
- [750] arXiv:2404.03887 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SAAS: Solving Ability Amplification Strategy for Enhanced Mathematical Reasoning in Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This study presents a novel learning approach designed to enhance both mathematical reasoning and problem-solving abilities of Large Language Models (LLMs). We focus on integrating the Chain-of-Thought (CoT) and the Program-of-Thought (PoT) learning, hypothesizing that prioritizing the learning of mathematical reasoning ability is helpful for the amplification of problem-solving ability. Thus, the initial learning with CoT is essential for solving challenging mathematical problems. To this end, we propose a sequential learning approach, named SAAS (Solving Ability Amplification Strategy), which strategically transitions from CoT learning to PoT learning. Our empirical study, involving an extensive performance comparison using several benchmarks, demonstrates that our SAAS achieves state-of-the-art (SOTA) performance. The results underscore the effectiveness of our sequential learning approach, marking a significant advancement in the field of mathematical reasoning in LLMs.
- [751] arXiv:2404.03888 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A proximal policy optimization based intelligent home solar managementComments: This manuscript has been accepted for IEEE EIT ConferenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In the smart grid, the prosumers can sell unused electricity back to the power grid, assuming the prosumers own renewable energy sources and storage units. The maximizing of their profits under a dynamic electricity market is a problem that requires intelligent planning. To address this, we propose a framework based on Proximal Policy Optimization (PPO) using recurrent rewards. By using the information about the rewards modeled effectively with PPO to maximize our objective, we were able to get over 30\% improvement over the other naive algorithms in accumulating total profits. This shows promise in getting reinforcement learning algorithms to perform tasks required to plan their actions in complex domains like financial markets. We also introduce a novel method for embedding longs based on soliton waves that outperformed normal embedding in our use case with random floating point data augmentation.
- [752] arXiv:2404.03891 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Can only LLMs do Reasoning?: Potential of Small Language Models in Task PlanningComments: 8 pages, 11 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In robotics, the use of Large Language Models (LLMs) is becoming prevalent, especially for understanding human commands. In particular, LLMs are utilized as domain-agnostic task planners for high-level human commands. LLMs are capable of Chain-of-Thought (CoT) reasoning, and this allows LLMs to be task planners. However, we need to consider that modern robots still struggle to perform complex actions, and the domains where robots can be deployed are limited in practice. This leads us to pose a question: If small LMs can be trained to reason in chains within a single domain, would even small LMs be good task planners for the robots? To train smaller LMs to reason in chains, we build `COmmand-STeps datasets' (COST) consisting of high-level commands along with corresponding actionable low-level steps, via LLMs. We release not only our datasets but also the prompt templates used to generate them, to allow anyone to build datasets for their domain. We compare GPT3.5 and GPT4 with the finetuned GPT2 for task domains, in tabletop and kitchen environments, and the result shows that GPT2-medium is comparable to GPT3.5 for task planning in a specific domain. Our dataset, code, and more output samples can be found in this https URL
- [753] arXiv:2404.03892 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AISubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: The Deep learning (DL) models for diagnosing breast cancer from mammographic images often operate as "black boxes", making it difficult for healthcare professionals to trust and understand their decision-making processes. The study presents an integrated framework combining Convolutional Neural Networks (CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an elaborate data preprocessing pipeline and advanced data augmentation techniques to counteract dataset limitations and transfer learning using pre-trained networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of our study is the evaluation of XAI's effectiveness in interpreting model predictions, highlighted by utilizing the Hausdorff measure to assess the alignment between AI-generated explanations and expert annotations quantitatively. This approach is critical for XAI in promoting trustworthiness and ethical fairness in AI-assisted diagnostics. The findings from our research illustrate the effective collaboration between CNNs and XAI in advancing diagnostic methods for breast cancer, thereby facilitating a more seamless integration of advanced AI technologies within clinical settings. By enhancing the interpretability of AI driven decisions, this work lays the groundwork for improved collaboration between AI systems and medical practitioners, ultimately enriching patient care. Furthermore, the implications of our research extended well beyond the current methodologies. It encourages further research into how to combine multimodal data and improve AI explanations to meet the needs of clinical practice.
- [754] arXiv:2404.03894 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Holon: a cybernetic interface for bio-semioticsComments: Paper accepted at ISEA 24, The 29th International Symposium on Electronic Art, Brisbane, Australia, 21-29 June 2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Audio and Speech Processing (eess.AS)
Abstract: This paper presents an interactive artwork, "Holon", a collection of 130 autonomous, cybernetic organisms that listen and make sound in collaboration with the natural environment. The work was developed for installation on water at a heritage-listed dock in Melbourne, Australia. Conceptual issues informing the work are presented, along with a detailed technical overview of the implementation. Individual holons are of three types, inspired by biological models of animal communication: composer/generators, collector/critics and disruptors. Collectively, Holon integrates and occupies elements of the acoustic spectrum in collaboration with human and non-human agents.
- [755] arXiv:2404.03900 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Nonparametric Modern Hopfield ModelsComments: 59 pages; Code available at this https URLSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: We present a nonparametric construction for deep learning compatible modern Hopfield models and utilize this framework to debut an efficient variant. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs. Crucially, our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing \textit{sparse-structured} modern Hopfield models with sub-quadratic complexity. We establish that this sparse model inherits the appealing theoretical properties of its dense analogue -- connection with transformer attention, fixed point convergence and exponential memory capacity -- even without knowing details of the Hopfield energy function. Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-$K$ and positive random feature modern Hopfield models. Empirically, we validate the efficacy of our framework in both synthetic and realistic settings.
- [756] arXiv:2404.03908 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multi-Task Learning for Lung sound & Lung disease classificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Sound (cs.SD)
Abstract: In recent years, advancements in deep learning techniques have considerably enhanced the efficiency and accuracy of medical diagnostics. In this work, a novel approach using multi-task learning (MTL) for the simultaneous classification of lung sounds and lung diseases is proposed. Our proposed model leverages MTL with four different deep learning models such as 2D CNN, ResNet50, MobileNet and Densenet to extract relevant features from the lung sound recordings. The ICBHI 2017 Respiratory Sound Database was employed in the current study. The MTL for MobileNet model performed better than the other models considered, with an accuracy of74\% for lung sound analysis and 91\% for lung diseases classification. Results of the experimentation demonstrate the efficacy of our approach in classifying both lung sounds and lung diseases concurrently.
In this study,using the demographic data of the patients from the database, risk level computation for Chronic Obstructive Pulmonary Disease is also carried out. For this computation, three machine learning algorithms namely Logistic Regression, SVM and Random Forest classifierswere employed. Among these ML algorithms, the Random Forest classifier had the highest accuracy of 92\%.This work helps in considerably reducing the physician's burden of not just diagnosing the pathology but also effectively communicating to the patient about the possible causes or outcomes. - [757] arXiv:2404.03912 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to LuxembourgishComments: 3rd Annual Meeting of the ELRA/ISCA Special Interest Group on Under-resourced Languages (SIGUL 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.
- [758] arXiv:2404.03913 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image ModelsComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.
- [759] arXiv:2404.03992 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Rolling the dice for better deep learning performance: A study of randomness techniques in deep neural networksMohammed Ghaith Altarabichi , Sławomir Nowaczyk , Sepideh Pashami , Peyman Sheikholharam Mashhadi , Julia HandlJournal-ref: Information Sciences, p.120500 (2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Abstract: This paper investigates how various randomization techniques impact Deep Neural Networks (DNNs). Randomization, like weight noise and dropout, aids in reducing overfitting and enhancing generalization, but their interactions are poorly understood. The study categorizes randomness techniques into four types and proposes new methods: adding noise to the loss function and random masking of gradient updates. Using Particle Swarm Optimizer (PSO) for hyperparameter optimization, it explores optimal configurations across MNIST, FASHION-MNIST, CIFAR10, and CIFAR100 datasets. Over 30,000 configurations are evaluated, revealing data augmentation and weight initialization randomness as main performance contributors. Correlation analysis shows different optimizers prefer distinct randomization types. The complete implementation and dataset are available on GitHub.
- [760] arXiv:2404.03995 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Balancing Progress and Responsibility: A Synthesis of Sustainability Trade-Offs of AI-Based SystemsComments: Accepted for publication at the 8th International Workshop on Green and Sustainable Software (GREENS'24), collocated with ICSA'24Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Recent advances in artificial intelligence (AI) capabilities have increased the eagerness of companies to integrate AI into software systems. While AI can be used to have a positive impact on several dimensions of sustainability, this is often overshadowed by its potential negative influence. While many studies have explored sustainability factors in isolation, there is insufficient holistic coverage of potential sustainability benefits or costs that practitioners need to consider during decision-making for AI adoption. We therefore aim to synthesize trade-offs related to sustainability in the context of integrating AI into software systems. We want to make the sustainability benefits and costs of integrating AI more transparent and accessible for practitioners.
The study was conducted in collaboration with a Dutch financial organization. We first performed a rapid review that led to the inclusion of 151 research papers. Afterward, we conducted six semi-structured interviews to enrich the data with industry perspectives. The combined results showcase the potential sustainability benefits and costs of integrating AI. The labels synthesized from the review regarding potential sustainability benefits were clustered into 16 themes, with "energy management" being the most frequently mentioned one. 11 themes were identified in the interviews, with the top mentioned theme being "employee wellbeing". Regarding sustainability costs, the review discovered seven themes, with "deployment issues" being the most popular one, followed by "ethics & society". "Environmental issues" was the top theme from the interviews. Our results provide valuable insights to organizations and practitioners for understanding the potential sustainability implications of adopting AI. - [761] arXiv:2404.03996 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Fast Genetic Algorithm for feature selection -- A qualitative approximation approachJournal-ref: Expert Systems with Applications, 211, p.118528 (2023)Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Evolutionary Algorithms (EAs) are often challenging to apply in real-world settings since evolutionary computations involve a large number of evaluations of a typically expensive fitness function. For example, an evaluation could involve training a new machine learning model. An approximation (also known as meta-model or a surrogate) of the true function can be used in such applications to alleviate the computation cost. In this paper, we propose a two-stage surrogate-assisted evolutionary approach to address the computational issues arising from using Genetic Algorithm (GA) for feature selection in a wrapper setting for large datasets. We define 'Approximation Usefulness' to capture the necessary conditions to ensure correctness of the EA computations when an approximation is used. Based on this definition, we propose a procedure to construct a lightweight qualitative meta-model by the active selection of data instances. We then use a meta-model to carry out the feature selection task. We apply this procedure to the GA-based algorithm CHC (Cross generational elitist selection, Heterogeneous recombination and Cataclysmic mutation) to create a Qualitative approXimations variant, CHCQX. We show that CHCQX converges faster to feature subset solutions of significantly higher accuracy (as compared to CHC), particularly for large datasets with over 100K instances. We also demonstrate the applicability of the thinking behind our approach more broadly to Swarm Intelligence (SI), another branch of the Evolutionary Computation (EC) paradigm with results of PSOQX, a qualitative approximation adaptation of the Particle Swarm Optimization (PSO) method. A GitHub repository with the complete implementation is available.
- [762] arXiv:2404.03997 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Demonstration Guided Multi-Objective Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Multi-objective reinforcement learning (MORL) is increasingly relevant due to its resemblance to real-world scenarios requiring trade-offs between multiple objectives. Catering to diverse user preferences, traditional reinforcement learning faces amplified challenges in MORL. To address the difficulty of training policies from scratch in MORL, we introduce demonstration-guided multi-objective reinforcement learning (DG-MORL). This novel approach utilizes prior demonstrations, aligns them with user preferences via corner weight support, and incorporates a self-evolving mechanism to refine suboptimal demonstrations. Our empirical studies demonstrate DG-MORL's superiority over existing MORL algorithms, establishing its robustness and efficacy, particularly under challenging conditions. We also provide an upper bound of the algorithm's sample complexity.
- [763] arXiv:2404.04001 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Approximate UMAP allows for high-rate online visualization of high-dimensional data streamsComments: 6 pages, 3 figures, submitted to the Graz BCI conference 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Abstract: In the BCI field, introspection and interpretation of brain signals are desired for providing feedback or to guide rapid paradigm prototyping but are challenging due to the high noise level and dimensionality of the signals. Deep neural networks are often introspected by transforming their learned feature representations into 2- or 3-dimensional subspace visualizations using projection algorithms like Uniform Manifold Approximation and Projection (UMAP). Unfortunately, these methods are computationally expensive, making the projection of data streams in real-time a non-trivial task. In this study, we introduce a novel variant of UMAP, called approximate UMAP (aUMAP). It aims at generating rapid projections for real-time introspection. To study its suitability for real-time projecting, we benchmark the methods against standard UMAP and its neural network counterpart parametric UMAP. Our results show that approximate UMAP delivers projections that replicate the projection space of standard UMAP while decreasing projection speed by an order of magnitude and maintaining the same training time.
- [764] arXiv:2404.04057 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step GenerationComments: ICML 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Abstract: We introduce Score identity Distillation (SiD), an innovative data-free method that distills the generative capabilities of pretrained diffusion models into a single-step generator. SiD not only facilitates an exponentially fast reduction in Fréchet inception distance (FID) during distillation but also approaches or even exceeds the FID performance of the original teacher diffusion models. By reformulating forward diffusion processes as semi-implicit distributions, we leverage three score-related identities to create an innovative loss mechanism. This mechanism achieves rapid FID reduction by training the generator using its own synthesized images, eliminating the need for real data or reverse-diffusion-based generation, all accomplished within significantly shortened generation time. Upon evaluation across four benchmark datasets, the SiD algorithm demonstrates high iteration efficiency during distillation and surpasses competing distillation approaches, whether they are one-step or few-step, data-free, or dependent on training data, in terms of generation quality. This achievement not only redefines the benchmarks for efficiency and effectiveness in diffusion distillation but also in the broader field of diffusion-based generation. The PyTorch implementation is available at this https URL
- [765] arXiv:2404.04067 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CLUE: A Clinical Language Understanding Evaluation for LLMsAmin Dada , Marie Bauer , Amanda Butler Contreras , Osman Alperen Koraş , Constantin Marc Seibold , Kaleb E Smith , Jens KleesiekSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have shown the potential to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs address healthcare-specific challenges, including privacy demands and computational constraints. However, evaluation of these models has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. Additionally, there has been no thorough comparison between biomedical and general-domain LLMs for clinical tasks. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on real-world clinical tasks. CLUE includes two novel datasets derived from MIMIC IV discharge letters and four existing tasks designed to test the practical applicability of LLMs in healthcare settings. Our evaluation covers several biomedical and general domain LLMs, providing insights into their clinical performance and applicability. CLUE represents a step towards a standardized approach to evaluating and developing LLMs in healthcare to align future model development with the real-world needs of clinical application. We publish our evaluation and data generation scripts: this https URL .
- [766] arXiv:2404.04095 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Dynamic Prompt Optimizing for Text-to-Image GenerationComments: Accepted to CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment. Code is available at this https URL .
- [767] arXiv:2404.04102 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Robust Preference Optimization with Provable Noise Tolerance for LLMsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The preference alignment aims to enable large language models (LLMs) to generate responses that conform to human values, which is essential for developing general AI systems. Ranking-based methods -- a promising class of alignment approaches -- learn human preferences from datasets containing response pairs by optimizing the log-likelihood margins between preferred and dis-preferred responses. However, due to the inherent differences in annotators' preferences, ranking labels of comparisons for response pairs are unavoidably noisy. This seriously hurts the reliability of existing ranking-based methods. To address this problem, we propose a provably noise-tolerant preference alignment method, namely RObust Preference Optimization (ROPO). To the best of our knowledge, ROPO is the first preference alignment method with noise-tolerance guarantees. The key idea of ROPO is to dynamically assign conservative gradient weights to response pairs with high label uncertainty, based on the log-likelihood margins between the responses. By effectively suppressing the gradients of noisy samples, our weighting strategy ensures that the expected risk has the same gradient direction independent of the presence and proportion of noise. Experiments on three open-ended text generation tasks with four base models ranging in size from 2.8B to 13B demonstrate that ROPO significantly outperforms existing ranking-based methods.
- [768] arXiv:2404.04139 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Precision Guided Approach to Mitigate Data Poisoning Attacks in Federated LearningComments: 14 pages, 11 figures, 5 tables, Accepted in ACM CODASPY 2024Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Federated Learning (FL) is a collaborative learning paradigm enabling participants to collectively train a shared machine learning model while preserving the privacy of their sensitive data. Nevertheless, the inherent decentralized and data-opaque characteristics of FL render its susceptibility to data poisoning attacks. These attacks introduce malformed or malicious inputs during local model training, subsequently influencing the global model and resulting in erroneous predictions. Current FL defense strategies against data poisoning attacks either involve a trade-off between accuracy and robustness or necessitate the presence of a uniformly distributed root dataset at the server. To overcome these limitations, we present FedZZ, which harnesses a zone-based deviating update (ZBDU) mechanism to effectively counter data poisoning attacks in FL. Further, we introduce a precision-guided methodology that actively characterizes these client clusters (zones), which in turn aids in recognizing and discarding malicious updates at the server. Our evaluation of FedZZ across two widely recognized datasets: CIFAR10 and EMNIST, demonstrate its efficacy in mitigating data poisoning attacks, surpassing the performance of prevailing state-of-the-art methodologies in both single and multi-client attack scenarios and varying attack volumes. Notably, FedZZ also functions as a robust client selection strategy, even in highly non-IID and attack-free scenarios. Moreover, in the face of escalating poisoning rates, the model accuracy attained by FedZZ displays superior resilience compared to existing techniques. For instance, when confronted with a 50% presence of malicious clients, FedZZ sustains an accuracy of 67.43%, while the accuracy of the second-best solution, FL-Defender, diminishes to 43.36%.
- [769] arXiv:2404.04159 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Noisy Label Processing for Classification: A SurveySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, deep neural networks (DNNs) have gained remarkable achievement in computer vision tasks, and the success of DNNs often depends greatly on the richness of data. However, the acquisition process of data and high-quality ground truth requires a lot of manpower and money. In the long, tedious process of data annotation, annotators are prone to make mistakes, resulting in incorrect labels of images, i.e., noisy labels. The emergence of noisy labels is inevitable. Moreover, since research shows that DNNs can easily fit noisy labels, the existence of noisy labels will cause significant damage to the model training process. Therefore, it is crucial to combat noisy labels for computer vision tasks, especially for classification tasks. In this survey, we first comprehensively review the evolution of different deep learning approaches for noisy label combating in the image classification task. In addition, we also review different noise patterns that have been proposed to design robust algorithms. Furthermore, we explore the inner pattern of real-world label noise and propose an algorithm to generate a synthetic label noise pattern guided by real-world data. We test the algorithm on the well-known real-world dataset CIFAR-10N to form a new real-world data-guided synthetic benchmark and evaluate some typical noise-robust methods on the benchmark.
- [770] arXiv:2404.04167 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language ModelXinrun Du , Zhouliang Yu , Songyang Gao , Ding Pan , Yuyang Cheng , Ziyang Ma , Ruibin Yuan , Xingwei Qu , Jiaheng Liu , Tianyu Zheng , Xinchen Luo , Guorui Zhou , Binhang Yuan , Wenhu Chen , Jie Fu , Ge ZhangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.
- [771] arXiv:2404.04205 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Enhancing IoT Intelligence: A Transformer-based Reinforcement Learning MethodologySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The proliferation of the Internet of Things (IoT) has led to an explosion of data generated by interconnected devices, presenting both opportunities and challenges for intelligent decision-making in complex environments. Traditional Reinforcement Learning (RL) approaches often struggle to fully harness this data due to their limited ability to process and interpret the intricate patterns and dependencies inherent in IoT applications. This paper introduces a novel framework that integrates transformer architectures with Proximal Policy Optimization (PPO) to address these challenges. By leveraging the self-attention mechanism of transformers, our approach enhances RL agents' capacity for understanding and acting within dynamic IoT environments, leading to improved decision-making processes. We demonstrate the effectiveness of our method across various IoT scenarios, from smart home automation to industrial control systems, showing marked improvements in decision-making efficiency and adaptability. Our contributions include a detailed exploration of the transformer's role in processing heterogeneous IoT data, a comprehensive evaluation of the framework's performance in diverse environments, and a benchmark against traditional RL methods. The results indicate significant advancements in enabling RL agents to navigate the complexities of IoT ecosystems, highlighting the potential of our approach to revolutionize intelligent automation and decision-making in the IoT landscape.
- [772] arXiv:2404.04219 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Continual Policy Distillation of Reinforcement Learning-based Controllers for Soft Robotic In-Hand ManipulationComments: Accepted for presentation at IEEE RoboSoft 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Dexterous manipulation, often facilitated by multi-fingered robotic hands, holds solid impact for real-world applications. Soft robotic hands, due to their compliant nature, offer flexibility and adaptability during object grasping and manipulation. Yet, benefits come with challenges, particularly in the control development for finger coordination. Reinforcement Learning (RL) can be employed to train object-specific in-hand manipulation policies, but limiting adaptability and generalizability. We introduce a Continual Policy Distillation (CPD) framework to acquire a versatile controller for in-hand manipulation, to rotate different objects in shape and size within a four-fingered soft gripper. The framework leverages Policy Distillation (PD) to transfer knowledge from expert policies to a continually evolving student policy network. Exemplar-based rehearsal methods are then integrated to mitigate catastrophic forgetting and enhance generalization. The performance of the CPD framework over various replay strategies demonstrates its effectiveness in consolidating knowledge from multiple experts and achieving versatile and adaptive behaviours for in-hand manipulation tasks.
- [773] arXiv:2404.04220 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Multi-modal perception for soft robotic interactions using generative modelsComments: Accepted for presentation at IEEE RoboSoft 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Perception is essential for the active interaction of physical agents with the external environment. The integration of multiple sensory modalities, such as touch and vision, enhances this perceptual process, creating a more comprehensive and robust understanding of the world. Such fusion is particularly useful for highly deformable bodies such as soft robots. Developing a compact, yet comprehensive state representation from multi-sensory inputs can pave the way for the development of complex control strategies. This paper introduces a perception model that harmonizes data from diverse modalities to build a holistic state representation and assimilate essential information. The model relies on the causality between sensory input and robotic actions, employing a generative model to efficiently compress fused information and predict the next observation. We present, for the first time, a study on how touch can be predicted from vision and proprioception on soft robots, the importance of the cross-modal generation and why this is essential for soft robotic interactions in unstructured environments.
- [774] arXiv:2404.04234 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: player2vec: A Language Modeling Approach to Understand Player Behavior in GamesTianze Wang , Maryam Honari-Jahromi , Styliani Katsarou , Olga Mikheeva , Theodoros Panagiotakopoulos , Sahar Asadi , Oleg SmirnovSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Methods for learning latent user representations from historical behavior logs have gained traction for recommendation tasks in e-commerce, content streaming, and other settings. However, this area still remains relatively underexplored in video and mobile gaming contexts. In this work, we present a novel method for overcoming this limitation by extending a long-range Transformer model from the natural language processing domain to player behavior data. We discuss specifics of behavior tracking in games and propose preprocessing and tokenization approaches by viewing in-game events in an analogous way to words in sentences, thus enabling learning player representations in a self-supervised manner in the absence of ground-truth annotations. We experimentally demonstrate the efficacy of the proposed approach in fitting the distribution of behavior events by evaluating intrinsic language modeling metrics. Furthermore, we qualitatively analyze the emerging structure of the learned embedding space and show its value for generating insights into behavior patterns to inform downstream applications.
- [775] arXiv:2404.04242 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Physical Property Understanding from Language-Embedded Feature FieldsAlbert J. Zhai , Yuan Shen , Emily Y. Chen , Gloria X. Wang , Xinlei Wang , Sheng Wang , Kaiyu Guan , Shenlong WangComments: CVPR 2024. Project page (with code): this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.
- [776] arXiv:2404.04243 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Identity Decoupling for Multi-Subject Personalization of Text-to-Image ModelsComments: Preprint. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image diffusion models have shown remarkable success in generating a personalized subject based on a few reference images. However, current methods struggle with handling multiple subjects simultaneously, often resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by the Segment Anything Model for both training and inference, as a form of data augmentation for training and initialization for the generation process. Our experiments demonstrate that MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. In human evaluation, MuDI shows twice as many successes for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% compared to the strongest baseline. More results are available at this https URL .
- [777] arXiv:2404.04251 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)Comments: 15 pages main, 9 pages appendices, 16 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness-the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented against few weak baselines by correlation to human Likert scores over a set of easy-to-discriminate images.
We introduce T2IScoreScore (TS2), a curated set of semantic error graphs containing a prompt and a set increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria. - [778] arXiv:2404.04253 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Growing Q-Networks: Solving Continuous Control Tasks with Adaptive Control ResolutionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Recent reinforcement learning approaches have shown surprisingly strong capabilities of bang-bang policies for solving continuous control benchmarks. The underlying coarse action space discretizations often yield favourable exploration characteristics while final performance does not visibly suffer in the absence of action penalization in line with optimal control theory. In robotics applications, smooth control signals are commonly preferred to reduce system wear and energy efficiency, but action costs can be detrimental to exploration during early training. In this work, we aim to bridge this performance gap by growing discrete action spaces from coarse to fine control resolution, taking advantage of recent results in decoupled Q-learning to scale our approach to high-dimensional action spaces up to dim(A) = 38. Our work indicates that an adaptive control resolution in combination with value decomposition yields simple critic-only algorithms that yield surprisingly strong performance on continuous control tasks.
- [779] arXiv:2404.04254 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Watermark-based Detection and Attribution of AI-Generated ContentSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Several companies--such as Google, Microsoft, and OpenAI--have deployed techniques to watermark AI-generated content to enable proactive detection. However, existing literature mainly focuses on user-agnostic detection. Attribution aims to further trace back the user of a generative-AI service who generated a given content detected as AI-generated. Despite its growing importance, attribution is largely unexplored. In this work, we aim to bridge this gap by providing the first systematic study on watermark-based, user-aware detection and attribution of AI-generated content. Specifically, we theoretically study the detection and attribution performance via rigorous probabilistic analysis. Moreover, we develop an efficient algorithm to select watermarks for the users to enhance attribution performance. Both our theoretical and empirical results show that watermark-based detection and attribution inherit the accuracy and (non-)robustness properties of the watermarking method.
- [780] arXiv:2404.04261 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Novel BERT-based Classifier to Detect Political Leaning of YouTube Videos based on their TitlesComments: 14 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: A quarter of US adults regularly get their news from YouTube. Yet, despite the massive political content available on the platform, to date no classifier has been proposed to identify the political leaning of YouTube videos. To fill this gap, we propose a novel classifier based on Bert -- a language model from Google -- to classify YouTube videos merely based on their titles into six categories, namely: Far Left, Left, Center, Anti-Woke, Right, and Far Right. We used a public dataset of 10 million YouTube video titles (under various categories) to train and validate the proposed classifier. We compare the classifier against several alternatives that we trained on the same dataset, revealing that our classifier achieves the highest accuracy (75%) and the highest F1 score (77%). To further validate the classification performance, we collect videos from YouTube channels of numerous prominent news agencies, such as Fox News and New York Times, which have widely known political leanings, and apply our classifier to their video titles. For the vast majority of cases, the predicted political leaning matches that of the news agency.
- [781] arXiv:2404.04264 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Logic Query of Thoughts: Guiding Large Language Models to Answer Complex Logic Queries with Knowledge GraphsLihui Liu , Zihao Wang , Ruizhong Qiu , Yikun Ban , Eunice Chan , Yangqiu Song , Jingrui He , Hanghang TongSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Despite the superb performance in many tasks, large language models (LLMs) bear the risk of generating hallucination or even wrong answers when confronted with tasks that demand the accuracy of knowledge. The issue becomes even more noticeable when addressing logic queries that require multiple logic reasoning steps. On the other hand, knowledge graph (KG) based question answering methods are capable of accurately identifying the correct answers with the help of knowledge graph, yet its accuracy could quickly deteriorate when the knowledge graph itself is sparse and incomplete. It remains a critical challenge on how to integrate knowledge graph reasoning with LLMs in a mutually beneficial way so as to mitigate both the hallucination problem of LLMs as well as the incompleteness issue of knowledge graphs. In this paper, we propose 'Logic-Query-of-Thoughts' (LGOT) which is the first of its kind to combine LLMs with knowledge graph based logic query reasoning. LGOT seamlessly combines knowledge graph reasoning and LLMs, effectively breaking down complex logic queries into easy to answer subquestions. Through the utilization of both knowledge graph reasoning and LLMs, it successfully derives answers for each subquestion. By aggregating these results and selecting the highest quality candidate answers for each step, LGOT achieves accurate results to complex questions. Our experimental findings demonstrate substantial performance enhancements, with up to 20% improvement over ChatGPT.
- [782] arXiv:2404.04268 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: The Use of Generative Search Engines for Knowledge Work and Complex TasksSiddharth Suri , Scott Counts , Leijie Wang , Chacha Chen , Mengting Wan , Tara Safavi , Jennifer Neville , Chirag Shah , Ryen W. White , Reid Andersen , Georg Buscher , Sathish Manivannan , Nagu Rangan , Longqi YangComments: 32 pages, 3 figures, 4 tablesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Abstract: Until recently, search engines were the predominant method for people to access online information. The recent emergence of large language models (LLMs) has given machines new capabilities such as the ability to generate new digital artifacts like text, images, code etc., resulting in a new tool, a generative search engine, which combines the capabilities of LLMs with a traditional search engine. Through the empirical analysis of Bing Copilot (Bing Chat), one of the first publicly available generative search engines, we analyze the types and complexity of tasks that people use Bing Copilot for compared to Bing Search. Findings indicate that people use the generative search engine for more knowledge work tasks that are higher in cognitive complexity than were commonly done with a traditional search engine.
- [783] arXiv:2404.04271 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Towards Effective Next POI Prediction: Spatial and Semantic Augmentation with Remote Sensing DataComments: 12 pages, 11 figures, Accepted by ICDE 2024Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: The next point-of-interest (POI) prediction is a significant task in location-based services, yet its complexity arises from the consolidation of spatial and semantic intent. This fusion is subject to the influences of historical preferences, prevailing location, and environmental factors, thereby posing significant challenges. In addition, the uneven POI distribution further complicates the next POI prediction procedure. To address these challenges, we enrich input features and propose an effective deep-learning method within a two-step prediction framework. Our method first incorporates remote sensing data, capturing pivotal environmental context to enhance input features regarding both location and semantics. Subsequently, we employ a region quad-tree structure to integrate urban remote sensing, road network, and POI distribution spaces, aiming to devise a more coherent graph representation method for urban spatial. Leveraging this method, we construct the QR-P graph for the user's historical trajectories to encapsulate historical travel knowledge, thereby augmenting input features with comprehensive spatial and semantic insights. We devise distinct embedding modules to encode these features and employ an attention mechanism to fuse diverse encodings. In the two-step prediction procedure, we initially identify potential spatial zones by predicting user-preferred tiles, followed by pinpointing specific POIs of a designated type within the projected tiles. Empirical findings from four real-world location-based social network datasets underscore the remarkable superiority of our proposed approach over competitive baseline methods.
- [784] arXiv:2404.04279 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: When Abel Kills Cain: What Machine Translation Cannot CaptureComments: in French languageJournal-ref: Ce qui \'echappe \'a l'Intelligence Artificielle, Hermann, pp.111-129, 2024, 9791037038449Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The article aims at identifying what, from a structural point of view, AI based automatic translators cannot fully capture. It focuses on the machine's mistakes, in order to try to explain its causes. The biblical story of Caïn and Abel has been chosen because of its rich interpretive and critical tradition, but also because of its semantic difficulty. The investigation begins with the observation, for the translation of this text, of the language pairs and interfaces offered by the best known machine translation services (Google Translate, DeepL). A typology of the most frequent translation errors is then established. Finally, contemporary translations are compared, in order to underline the unique contribution of each. In conclusion, the article suggests a revision of translation theory and, corArtificial Intelligence, Translation, Limitations, Interpretation, Comparison, Unicityelatively, a reformulation of its technology concerning cultural texts.
- [785] arXiv:2404.04281 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Similar Data Points Identification with LLM: A Human-in-the-loop Strategy Using Summarization and Hidden State InsightsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This study introduces a simple yet effective method for identifying similar data points across non-free text domains, such as tabular and image data, using Large Language Models (LLMs). Our two-step approach involves data point summarization and hidden state extraction. Initially, data is condensed via summarization using an LLM, reducing complexity and highlighting essential information in sentences. Subsequently, the summarization sentences are fed through another LLM to extract hidden states, serving as compact, feature-rich representations. This approach leverages the advanced comprehension and generative capabilities of LLMs, offering a scalable and efficient strategy for similarity identification across diverse datasets. We demonstrate the effectiveness of our method in identifying similar data points on multiple datasets. Additionally, our approach enables non-technical domain experts, such as fraud investigators or marketing operators, to quickly identify similar data points tailored to specific scenarios, demonstrating its utility in practical applications. In general, our results open new avenues for leveraging LLMs in data analysis across various domains.
- [786] arXiv:2404.04285 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain ExpertiseChunyuan Deng , Xiangru Tang , Yilun Zhao , Hanming Wang , Haoran Wang , Wangchunshu Zhou , Arman Cohan , Mark GersteinSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across a wide variety of tasks. However, without specific agent tuning, open-source models like LLaMA currently struggle to match the efficiency of GPT- 4, particularly given the scarcity of agent-tuning datasets for fine-tuning. In response, we introduce \textsc{Mimir}: a streamlined platform offering a customizable pipeline that enables users to leverage both private knowledge and publicly available, legally compliant datasets at scale for \textbf{personalized agent tuning}. Additionally, \textsc{Mimir} supports the generation of general instruction-tuning datasets from the same input. This dual capability ensures that language agents developed through the platform possess both specific agent abilities and general competencies. \textsc{Mimir} integrates these features into a cohesive end-to-end platform, facilitating everything from the uploading of personalized files to one-click agent fine-tuning.
- [787] arXiv:2404.04286 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Language Model Evolution: An Iterated Learning PerspectiveSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in prominence. Thus, in both short and long terms, LLMs may actively engage in an evolutionary process. We draw parallels between the behavior of LLMs and the evolution of human culture, as the latter has been extensively studied by cognitive scientists for decades. Our approach involves leveraging Iterated Learning (IL), a Bayesian framework that elucidates how subtle biases are magnified during human cultural evolution, to explain some behaviors of LLMs. This paper outlines key characteristics of agents' behavior in the Bayesian-IL framework, including predictions that are supported by experimental verification with various LLMs. This theoretical framework could help to more effectively predict and guide the evolution of LLMs in desired directions.
- [788] arXiv:2404.04287 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CONFLARE: CONFormal LArge language model REtrievalComments: Github code: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Retrieval-augmented generation (RAG) frameworks enable large language models (LLMs) to retrieve relevant information from a knowledge base and incorporate it into the context for generating responses. This mitigates hallucinations and allows for the updating of knowledge without retraining the LLM. However, RAG does not guarantee valid responses if retrieval fails to identify the necessary information as the context for response generation. Also, if there is contradictory content, the RAG response will likely reflect only one of the two possible responses. Therefore, quantifying uncertainty in the retrieval process is crucial for ensuring RAG trustworthiness. In this report, we introduce a four-step framework for applying conformal prediction to quantify retrieval uncertainty in RAG frameworks. First, a calibration set of questions answerable from the knowledge base is constructed. Each question's embedding is compared against document embeddings to identify the most relevant document chunks containing the answer and record their similarity scores. Given a user-specified error rate ({\alpha}), these similarity scores are then analyzed to determine a similarity score cutoff threshold. During inference, all chunks with similarity exceeding this threshold are retrieved to provide context to the LLM, ensuring the true answer is captured in the context with a (1-{\alpha}) confidence level. We provide a Python package that enables users to implement the entire workflow proposed in our work, only using LLMs and without human intervention.
- [789] arXiv:2404.04292 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Conversational Disease Diagnosis via External Planner-Controlled Large Language ModelsComments: Work in ProgressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The development of large language models (LLMs) has brought unprecedented possibilities for artificial intelligence (AI) based medical diagnosis. However, the application perspective of LLMs in real diagnostic scenarios is still unclear because they are not adept at collecting patient data proactively. This study presents a LLM-based diagnostic system that enhances planning capabilities by emulating doctors. Our system involves two external planners to handle planning tasks. The first planner employs a reinforcement learning approach to formulate disease screening questions and conduct initial diagnoses. The second planner uses LLMs to parse medical guidelines and conduct differential diagnoses. By utilizing real patient electronic medical record data, we constructed simulated dialogues between virtual patients and doctors and evaluated the diagnostic abilities of our system. We demonstrate that our system significantly surpasses existing models, including GPT-4 Turbo, in both disease screening and differential diagnoses. This research represents a step towards more seamlessly integrating AI into clinical settings, potentially enhancing the accuracy and accessibility of medical diagnostics.
- [790] arXiv:2404.04293 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy UnderstandingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks, but they still struggle with some complicated reasoning tasks including logical reasoning. One non-negligible reason for LLMs' suboptimal performance on logical reasoning is their overlooking of understanding logical fallacies correctly. To evaluate LLMs' capability of logical fallacy understanding (LFU), we propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper. Towards these LFU tasks, we have successfully constructed a new dataset LFUD based on GPT-4 accompanied by a little human effort. Our extensive experiments justify that our LFUD can be used not only to evaluate LLMs' LFU capability, but also to fine-tune LLMs to obtain significantly enhanced performance on logical reasoning.
- [791] arXiv:2404.04299 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: GENEVIC: GENetic data Exploration and Visualization via Intelligent interactive ConsoleAnindita Nath (1), Savannah Mwesigwa (1), Yulin Dai (1), Xiaoqian Jiang (2), Zhongming Zhao (1 and 3) ((1) Center for Precision Health, McWilliams School of Biomedical Informatics, UT Health Houston, TX (2) Department of Health Data Science and Artificial Intelligence, McWilliams School of Biomedical Informatics, UT Health Houston, TX, (3) MD Anderson Cancer Center, UTHealth Graduate School of Biomedical Sciences, Houston, TX)Subjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI)
Abstract: Summary: The vast generation of genetic data poses a significant challenge in efficiently uncovering valuable knowledge. Introducing GENEVIC, an AI-driven chat framework that tackles this challenge by bridging the gap between genetic data generation and biomedical knowledge discovery. Leveraging generative AI, notably ChatGPT, it serves as a biologist's 'copilot'. It automates the analysis, retrieval, and visualization of customized domain-specific genetic information, and integrates functionalities to generate protein interaction networks, enrich gene sets, and search scientific literature from PubMed, Google Scholar, and arXiv, making it a comprehensive tool for biomedical research. In its pilot phase, GENEVIC is assessed using a curated database that ranks genetic variants associated with Alzheimer's disease, schizophrenia, and cognition, based on their effect weights from the Polygenic Score Catalog, thus enabling researchers to prioritize genetic variants in complex diseases. GENEVIC's operation is user-friendly, accessible without any specialized training, secured by Azure OpenAI's HIPAA-compliant infrastructure, and evaluated for its efficacy through real-time query testing. As a prototype, GENEVIC is set to advance genetic research, enabling informed biomedical decisions.
Availability and implementation: GENEVIC is publicly accessible at https://genevic-anath2024.streamlit.app. The underlying code is open-source and available via GitHub at this https URL . - [792] arXiv:2404.04302 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question AnsweringNirmalie Wiratunga , Ramitha Abeyratne , Lasal Jayawardena , Kyle Martin , Stewart Massie , Ikechukwu Nkisi-Orji , Ruvan Weerasinghe , Anne Liret , Bruno FleischComments: Submitted to ICCBR'24Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Model (LLM) output by providing prior knowledge as context to input. This is beneficial for knowledge-intensive and expert reliant tasks, including legal question-answering, which require evidence to validate generated text outputs. We highlight that Case-Based Reasoning (CBR) presents key opportunities to structure retrieval as part of the RAG process in an LLM. We introduce CBR-RAG, where CBR cycle's initial retrieval stage, its indexing vocabulary, and similarity knowledge containers are used to enhance LLM queries with contextually relevant cases. This integration augments the original LLM query, providing a richer prompt. We present an evaluation of CBR-RAG, and examine different representations (i.e. general and domain-specific embeddings) and methods of comparison (i.e. inter, intra and hybrid similarity) on the task of legal question-answering. Our results indicate that the context provided by CBR's case reuse enforces similarity between relevant components of the questions and the evidence base leading to significant improvements in the quality of generated answers.
- [793] arXiv:2404.04306 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: AuditGPT: Auditing Smart Contracts with ChatGPTSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each containing a set of rules to guide the behaviors of smart contracts. Violating the ERC rules could cause serious security issues and financial loss, signifying the importance of verifying smart contracts follow ERCs. Today's practices of such verification are to either manually audit each single contract or use expert-developed, limited-scope program-analysis tools, both of which are far from being effective in identifying ERC rule violations. This paper presents a tool named AuditGPT that leverages large language models (LLMs) to automatically and comprehensively verify ERC rules against smart contracts. To build AuditGPT, we first conduct an empirical study on 222 ERC rules specified in four popular ERCs to understand their content, their security impacts, their specification in natural language, and their implementation in Solidity. Guided by the study, we construct AuditGPT by separating the large, complex auditing process into small, manageable tasks and design prompts specialized for each ERC rule type to enhance LLMs' auditing performance. In the evaluation, AuditGPT successfully pinpoints 418 ERC rule violations and only reports 18 false positives, showcasing its effectiveness and accuracy. Moreover, AuditGPT beats an auditing service provided by security experts in effectiveness, accuracy, and cost, demonstrating its advancement over state-of-the-art smart-contract auditing practices.
- [794] arXiv:2404.04310 (cross-list from nlin.PS) [ pdf , ps , html , other ]
-
Title: Suppressing Modulation Instability with Reinforcement LearningSubjects: Pattern Formation and Solitons (nlin.PS) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Applied Physics (physics.app-ph)
Abstract: Modulation instability is a phenomenon of spontaneous pattern formation in nonlinear media, oftentimes leading to an unpredictable behaviour and a degradation of a signal of interest. We propose an approach based on reinforcement learning to suppress the unstable modes by optimizing the parameters for the time modulation of the potential in the nonlinear system. We test our approach in 1D and 2D cases and propose a new class of physically-meaningful reward functions to guarantee tamed instability.
- [795] arXiv:2404.04311 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Real-time Anomaly Detection Using Convolutional Autoencoder with Dynamic ThresholdSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The majority of modern consumer-level energy is generated by real-time smart metering systems. These frequently contain anomalies, which prevent reliable estimates of the series' evolution. This work introduces a hybrid modeling approach combining statistics and a Convolutional Autoencoder with a dynamic threshold. The threshold is determined based on Mahalanobis distance and moving averages. It has been tested using real-life energy consumption data collected from smart metering systems. The solution includes a real-time, meter-level anomaly detection system that connects to an advanced monitoring system. This makes a substantial contribution by detecting unusual data movements and delivering an early warning. Early detection and subsequent troubleshooting can financially benefit organizations and consumers and prevent disasters from occurring.
- [796] arXiv:2404.04312 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Half-Space Feature Learning in Neural NetworksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
- [797] arXiv:2404.04313 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: JobFormer: Skill-Aware Job Recommendation with Semantic-Enhanced TransformerSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Job recommendation aims to provide potential talents with suitable job descriptions (JDs) consistent with their career trajectory, which plays an essential role in proactive talent recruitment. In real-world management scenarios, the available JD-user records always consist of JDs, user profiles, and click data, in which the user profiles are typically summarized as the user's skill distribution for privacy reasons. Although existing sophisticated recommendation methods can be directly employed, effective recommendation still has challenges considering the information deficit of JD itself and the natural heterogeneous gap between JD and user profile. To address these challenges, we proposed a novel skill-aware recommendation model based on the designed semantic-enhanced transformer to parse JDs and complete personalized job recommendation. Specifically, we first model the relative items of each JD and then adopt an encoder with the local-global attention mechanism to better mine the intra-job and inter-job dependencies from JD tuples. Moreover, we adopt a two-stage learning strategy for skill-aware recommendation, in which we utilize the skill distribution to guide JD representation learning in the recall stage, and then combine the user profiles for final prediction in the ranking stage. Consequently, we can embed rich contextual semantic representations for learning JDs, while skill-aware recommendation provides effective JD-user joint representation for click-through rate (CTR) prediction. To validate the superior performance of our method for job recommendation, we present a thorough empirical analysis of large-scale real-world and public datasets to demonstrate its effectiveness and interpretability.
- [798] arXiv:2404.04316 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens RotationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the increasingly powerful performances and enormous scales of Pretrained Language Models (PLMs), promoting parameter efficiency in fine-tuning has become a crucial need for effective and efficient adaptation to various downstream tasks. One representative line of fine-tuning methods is Orthogonal Fine-tuning (OFT), which rigorously preserves the angular distances within the parameter space to preserve the pretrained knowledge. Despite the empirical effectiveness, OFT still suffers low parameter efficiency at $\mathcal{O}(d^2)$ and limited capability of downstream adaptation. Inspired by Givens rotation, in this paper, we proposed quasi-Givens Orthogonal Fine-Tuning (qGOFT) to address the problems. We first use $\mathcal{O}(d)$ Givens rotations to accomplish arbitrary orthogonal transformation in $SO(d)$ with provable equivalence, reducing parameter complexity from $\mathcal{O}(d^2)$ to $\mathcal{O}(d)$. Then we introduce flexible norm and relative angular adjustments under soft orthogonality regularization to enhance the adaptation capability of downstream semantic deviations. Extensive experiments on various tasks and PLMs validate the effectiveness of our methods.
- [799] arXiv:2404.04318 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Robust Depth Enhancement via Polarization Prompt Fusion TuningComments: CVPR 2024. Project page: this https URL . The first two authors contribute equallySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Existing depth sensors are imperfect and may provide inaccurate depth values in challenging scenarios, such as in the presence of transparent or reflective objects. In this work, we present a general framework that leverages polarization imaging to improve inaccurate depth measurements from various depth sensors. Previous polarization-based depth enhancement methods focus on utilizing pure physics-based formulas for a single sensor. In contrast, our method first adopts a learning-based strategy where a neural network is trained to estimate a dense and complete depth map from polarization data and a sensor depth map from different sensors. To further improve the performance, we propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively utilize RGB-based models pre-trained on large-scale datasets, as the size of the polarization dataset is limited to train a strong model from scratch. We conducted extensive experiments on a public dataset, and the results demonstrate that the proposed method performs favorably compared to existing depth enhancement baselines. Code and demos are available at this https URL .
- [800] arXiv:2404.04332 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Scope Ambiguities in Large Language ModelsComments: To be published in Transactions of the Association for Computational LinguisticsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models -- GPT-2, GPT-3/3.5, Llama 2 and GPT-4 -- treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).
- [801] arXiv:2404.04351 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Assisting humans in complex comparisons: automated information comparison at scaleComments: 11 pages, 7 figures, 5 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Generative Large Language Models enable efficient analytics across knowledge domains, rivalling human experts in information comparisons. However, the applications of LLMs for information comparisons face scalability challenges due to the difficulties in maintaining information across large contexts and overcoming model token limitations. To address these challenges, we developed the novel Abstractive Summarization \& Criteria-driven Comparison Endpoint (ASC$^2$End) system to automate information comparison at scale. Our system employs Semantic Text Similarity comparisons for generating evidence-supported analyses. We utilize proven data-handling strategies such as abstractive summarization and retrieval augmented generation to overcome token limitations and retain relevant information during model inference. Prompts were designed using zero-shot strategies to contextualize information for improved model reasoning. We evaluated abstractive summarization using ROUGE scoring and assessed the generated comparison quality using survey responses. Models evaluated on the ASC$^2$End system show desirable results providing insights on the expected performance of the system. ASC$^2$End is a novel system and tool that enables accurate, automated information comparison at scale across knowledge domains, overcoming limitations in context length and retrieval.
- [802] arXiv:2404.04376 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ClickDiffusion: Harnessing LLMs for Interactive Precise Image EditingComments: arXiv admin note: substantial text overlap with arXiv:2402.07925Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recently, researchers have proposed powerful systems for generating and manipulating images using natural language instructions. However, it is difficult to precisely specify many common classes of image transformations with text alone. For example, a user may wish to change the location and breed of a particular dog in an image with several similar dogs. This task is quite difficult with natural language alone, and would require a user to write a laboriously complex prompt that both disambiguates the target dog and describes the destination. We propose ClickDiffusion, a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface. We demonstrate that by serializing both an image and a multi-modal instruction into a textual representation it is possible to leverage LLMs to perform precise transformations of the layout and appearance of an image. Code available at this https URL .
- [803] arXiv:2404.04392 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Increased LLM Vulnerabilities from Fine-tuning and QuantizationSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have become very popular and have found use cases in many domains, such as chatbots, auto-task completion agents, and much more. However, LLMs are vulnerable to different types of attacks, such as jailbreaking, prompt injection attacks, and privacy leakage attacks. Foundational LLMs undergo adversarial and alignment training to learn not to generate malicious and toxic content. For specialized use cases, these foundational LLMs are subjected to fine-tuning or quantization for better performance and efficiency. We examine the impact of downstream tasks such as fine-tuning and quantization on LLM vulnerability. We test foundation models like Mistral, Llama, MosaicML, and their fine-tuned versions. Our research shows that fine-tuning and quantization reduces jailbreak resistance significantly, leading to increased LLM vulnerabilities. Finally, we demonstrate the utility of external guardrails in reducing LLM vulnerabilities.
- [804] arXiv:2404.04399 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous TransformerToru Shirakawa , Yi Li , Yulun Wu , Sky Qiu , Yuxuan Li , Mingduo Zhao , Hiroyasu Iso , Mark van der LaanSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
Abstract: We propose Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to estimate the counterfactual mean of outcome under dynamic treatment policies in longitudinal problem settings. Our approach utilizes a transformer architecture with heterogeneous type embedding trained using temporal-difference learning. After obtaining an initial estimate using the transformer, following the targeted minimum loss-based likelihood estimation (TMLE) framework, we statistically corrected for the bias commonly associated with machine learning algorithms. Furthermore, our method also facilitates statistical inference by enabling the provision of 95% confidence intervals grounded in asymptotic statistical theory. Simulation results demonstrate our method's superior performance over existing approaches, particularly in complex, long time-horizon scenarios. It remains effective in small-sample, short-duration contexts, matching the performance of asymptotically efficient estimators. To demonstrate our method in practice, we applied our method to estimate counterfactual mean outcomes for standard versus intensive blood pressure management strategies in a real-world cardiovascular epidemiology cohort study.
- [805] arXiv:2404.04403 (cross-list from stat.ME) [ pdf , ps , html , other ]
-
Title: Low-Rank Robust Subspace Tensor Clustering for Metro Passenger Flow ModelingComments: Conditionally Accepted in INFORMS Journal of Data ScienceSubjects: Methodology (stat.ME) ; Artificial Intelligence (cs.AI)
Abstract: Tensor clustering has become an important topic, specifically in spatio-temporal modeling, due to its ability to cluster spatial modes (e.g., stations or road segments) and temporal modes (e.g., time of the day or day of the week). Our motivating example is from subway passenger flow modeling, where similarities between stations are commonly found. However, the challenges lie in the innate high-dimensionality of tensors and also the potential existence of anomalies. This is because the three tasks, i.e., dimension reduction, clustering, and anomaly decomposition, are inter-correlated to each other, and treating them in a separate manner will render a suboptimal performance. Thus, in this work, we design a tensor-based subspace clustering and anomaly decomposition technique for simultaneously outlier-robust dimension reduction and clustering for high-dimensional tensors. To achieve this, a novel low-rank robust subspace clustering decomposition model is proposed by combining Tucker decomposition, sparse anomaly decomposition, and subspace clustering. An effective algorithm based on Block Coordinate Descent is proposed to update the parameters. Prudent experiments prove the effectiveness of the proposed framework via the simulation study, with a gain of +25% clustering accuracy than benchmark methods in a hard case. The interrelations of the three tasks are also analyzed via ablation studies, validating the interrelation assumption. Moreover, a case study in the station clustering based on real passenger flow data is conducted, with quite valuable insights discovered.
- [806] arXiv:2404.04420 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay VideosComments: Accepted for publication at the 19th International Conference on the Foundations of Digital GamesSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Abstract: Neural models are one of the most popular approaches for music generation, yet there aren't standard large datasets tailored for learning music directly from game data. To address this research gap, we introduce a novel dataset named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is built upon the Nintendo Entertainment System Music Database (NES-MDB), encompassing 5,278 music pieces from 397 NES games. Our approach involves collecting long-play videos for 389 games of the original dataset, slicing them into 15-second-long clips, and extracting the audio from each clip. Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to automatically identify the corresponding piece in the NES-MDB dataset. Additionally, we introduce a baseline method based on the Controllable Music Transformer to generate NES music conditioned on gameplay clips. We evaluated this approach with objective metrics, and the results showed that the conditional CMT improves musical structural quality when compared to its unconditional counterpart. Moreover, we used a neural classifier to predict the game genre of the generated pieces. Results showed that the CMT generator can learn correlations between gameplay videos and game genres, but further research has to be conducted to achieve human-level performance.
- [807] arXiv:2404.04446 (cross-list from stat.ME) [ pdf , ps , other ]
-
Title: Bounding Causal Effects with Leaky InstrumentsDavid S. Watson , Jordan Penn , Lee M. Gunderson , Gecia Bravo-Hermsdorff , Afsaneh Mastouri , Ricardo SilvaComments: Camera ready version (UAI 2024)Journal-ref: 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)Subjects: Methodology (stat.ME) ; Artificial Intelligence (cs.AI)
Abstract: Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to data that do not meet the exclusion criterion, estimated causal effects may be badly biased. In this work, we propose a novel solution that provides $\textit{partial}$ identification in linear systems given a set of $\textit{leaky instruments}$, which are allowed to violate the exclusion criterion to some limited degree. We derive a convex optimization objective that provides provably sharp bounds on the average treatment effect under some common forms of information leakage, and implement inference procedures to quantify the uncertainty of resulting estimates. We demonstrate our method in a set of experiments with simulated data, where it performs favorably against the state of the art. An accompanying $\texttt{R}$ package, $\texttt{leakyIV}$, is available from $\texttt{CRAN}$.
- [808] arXiv:2404.04452 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Vision Transformers in Domain Adaptation and Generalization: A Study of RobustnessComments: 28 pages, 5 figures, Preprint submitted to ElsevierSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deep learning models are often evaluated in scenarios where the data distribution is different from those used in the training and validation phases. The discrepancy presents a challenge for accurately predicting the performance of models once deployed on the target distribution. Domain adaptation and generalization are widely recognized as effective strategies for addressing such shifts, thereby ensuring reliable performance. The recent promising results in applying vision transformers in computer vision tasks, coupled with advancements in self-attention mechanisms, have demonstrated their significant potential for robustness and generalization in handling distribution shifts. Motivated by the increased interest from the research community, our paper investigates the deployment of vision transformers in domain adaptation and domain generalization scenarios. For domain adaptation methods, we categorize research into feature-level, instance-level, model-level adaptations, and hybrid approaches, along with other categorizations with respect to diverse strategies for enhancing domain adaptation. Similarly, for domain generalization, we categorize research into multi-domain learning, meta-learning, regularization techniques, and data augmentation strategies. We further classify diverse strategies in research, underscoring the various approaches researchers have taken to address distribution shifts by integrating vision transformers. The inclusion of comprehensive tables summarizing these categories is a distinct feature of our work, offering valuable insights for researchers. These findings highlight the versatility of vision transformers in managing distribution shifts, crucial for real-world applications, especially in critical safety and decision-making scenarios.
- [809] arXiv:2404.04456 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Beyond the Known: Adversarial Autoencoders in Novelty DetectionComments: Accepted at the VISAAP 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In novelty detection, the goal is to decide if a new data point should be categorized as an inlier or an outlier, given a training dataset that primarily captures the inlier distribution. Recent approaches typically use deep encoder and decoder network frameworks to derive a reconstruction error, and employ this error either to determine a novelty score, or as the basis for a one-class classifier. In this research, we use a similar framework but with a lightweight deep network, and we adopt a probabilistic score with reconstruction error. Our methodology calculates the probability of whether the sample comes from the inlier distribution or not. This work makes two key contributions. The first is that we compute the novelty probability by linearizing the manifold that holds the structure of the inlier distribution. This allows us to interpret how the probability is distributed and can be determined in relation to the local coordinates of the manifold tangent space. The second contribution is that we improve the training protocol for the network. Our results indicate that our approach is effective at learning the target class, and it outperforms recent state-of-the-art methods on several benchmark datasets.
- [810] arXiv:2404.04475 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Length-Controlled AlpacaEval: A Simple Way to Debias Automatic EvaluatorsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Abstract: LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?". To achieve this, we first fit a generalized linear model to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code and leaderboard at this https URL .
- [811] arXiv:2404.04481 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Joint Identifiability of Cross-Domain Recommendation via Hierarchical Subspace DisentanglementComments: accepted to SIGIR 2024 as a Full Research PaperSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Cross-Domain Recommendation (CDR) seeks to enable effective knowledge transfer across domains. Existing works rely on either representation alignment or transformation bridges, but they struggle on identifying domain-shared from domain-specific latent factors. Specifically, while CDR describes user representations as a joint distribution over two domains, these methods fail to account for its joint identifiability as they primarily fixate on the marginal distribution within a particular domain. Such a failure may overlook the conditionality between two domains and how it contributes to latent factor disentanglement, leading to negative transfer when domains are weakly correlated. In this study, we explore what should and should not be transferred in cross-domain user representations from a causality perspective. We propose a Hierarchical subspace disentanglement approach to explore the Joint IDentifiability of cross-domain joint distribution, termed HJID, to preserve domain-specific behaviors from domain-shared factors. HJID organizes user representations into layers: generic shallow subspaces and domain-oriented deep subspaces. We first encode the generic pattern in the shallow subspace by minimizing the Maximum Mean Discrepancy of initial layer activation. Then, to dissect how domain-oriented latent factors are encoded in deeper layers activation, we construct a cross-domain causality-based data generation graph, which identifies cross-domain consistent and domain-specific components, adhering to the Minimal Change principle. This allows HJID to maintain stability whilst discovering unique factors for different domains, all within a generative framework of invertible transformations that guarantee the joint identifiability. With experiments on real-world datasets, we show that HJID outperforms SOTA methods on a range of strongly and weakly correlated CDR tasks.
- [812] arXiv:2404.04492 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Automated Lane Change Behavior Prediction and Environmental Perception Based on SLAM TechnologySubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In addition to environmental perception sensors such as cameras, radars, etc. in the automatic driving system, the external environment of the vehicle is perceived, in fact, there is also a perception sensor that has been silently dedicated in the system, that is, the positioning module. This paper explores the application of SLAM (Simultaneous Localization and Mapping) technology in the context of automatic lane change behavior prediction and environment perception for autonomous vehicles. It discusses the limitations of traditional positioning methods, introduces SLAM technology, and compares LIDAR SLAM with visual SLAM. Real-world examples from companies like Tesla, Waymo, and Mobileye showcase the integration of AI-driven technologies, sensor fusion, and SLAM in autonomous driving systems. The paper then delves into the specifics of SLAM algorithms, sensor technologies, and the importance of automatic lane changes in driving safety and efficiency. It highlights Tesla's recent update to its Autopilot system, which incorporates automatic lane change functionality using SLAM technology. The paper concludes by emphasizing the crucial role of SLAM in enabling accurate environment perception, positioning, and decision-making for autonomous vehicles, ultimately enhancing safety and driving experience.
- [813] arXiv:2404.04500 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Trustless Audits without Revealing Data or ModelsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: There is an increasing conflict between business incentives to hide models and data as trade secrets, and the societal need for algorithmic transparency. For example, a rightsholder wishing to know whether their copyrighted works have been used during training must convince the model provider to allow a third party to audit the model and data. Finding a mutually agreeable third party is difficult, and the associated costs often make this approach impractical.
In this work, we show that it is possible to simultaneously allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties. We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights, alongside a zero-knowledge proof (ZKP) certifying that published commitments are derived from training the model. Model providers can then respond to audit requests by privately computing any function F of the dataset (or model) and releasing the output of F alongside another ZKP certifying the correct execution of F. To enable ZkAudit, we develop new methods of computing ZKPs for SGD on modern neural nets for simple recommender systems and image classification models capable of high accuracies on ImageNet. Empirically, we show it is possible to provide trustless audits of DNNs, including copyright, censorship, and counterfactual audits with little to no loss in accuracy. - [814] arXiv:2404.04510 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical TrialsComments: Accepted at SemEval 2024, NAACL 2024; 8 PagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language models (LLMs) have demonstrated state-of-the-art performance in various natural language processing (NLP) tasks across multiple domains, yet they are prone to shortcut learning and factual inconsistencies. This research investigates LLMs' robustness, consistency, and faithful reasoning when performing Natural Language Inference (NLI) on breast cancer Clinical Trial Reports (CTRs) in the context of SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We examine the reasoning capabilities of LLMs and their adeptness at logical problem-solving. A comparative analysis is conducted on pre-trained language models (PLMs), GPT-3.5, and Gemini Pro under zero-shot settings using Retrieval-Augmented Generation (RAG) framework, integrating various reasoning chains. The evaluation yields an F1 score of 0.69, consistency of 0.71, and a faithfulness score of 0.90 on the test dataset.
- [815] arXiv:2404.04511 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Cluster-based Video Summarization with Temporal Context AwarenessComments: 14 pages, 6 figures, accepted in PSIVT 2023Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart from prior cluster-based summarization methods. The resulting temporal-aware clusters are then utilized to compute the final summary, using simple rules for keyframe selection and frame importance scoring. Experimental results on the SumMe dataset demonstrate the effectiveness of our proposed approach, outperforming existing unsupervised methods and achieving comparable performance to state-of-the-art supervised summarization techniques. Our source code is available for reference at \url{ this https URL }.
- [816] arXiv:2404.04513 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual TextsComments: Accepted at SemEval 2024, NAACL 2024; 6 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper describes our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness. The challenge is focused on automatically detecting the degree of relatedness between pairs of sentences for 14 languages including both high and low-resource Asian and African languages. Our team participated in two subtasks consisting of Track A: supervised and Track B: unsupervised. This paper focuses on a BERT-based contrastive learning and similarity metric based approach primarily for the supervised track while exploring autoencoders for the unsupervised track. It also aims on the creation of a bigram relatedness corpus using negative sampling strategy, thereby producing refined word embeddings.
- [817] arXiv:2404.04517 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Latent-based Diffusion Model for Long-tailed RecognitionComments: 8 pages, 3 figures. Accepted by L3DIVU-CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Long-tailed imbalance distribution is a common issue in practical computer vision applications. Previous works proposed methods to address this problem, which can be categorized into several classes: re-sampling, re-weighting, transfer learning, and feature augmentation. In recent years, diffusion models have shown an impressive generation ability in many sub-problems of deep computer vision. However, its powerful generation has not been explored in long-tailed problems. We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), as a feature augmentation method to tackle the issue. First, we encode the imbalanced dataset into features using the baseline model. Then, we train a Denoising Diffusion Implicit Model (DDIM) using these encoded features to generate pseudo-features. Finally, we train the classifier using the encoded and pseudo-features from the previous two steps. The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.
- [818] arXiv:2404.04520 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IITK at SemEval-2024 Task 4: Hierarchical Embeddings for Detection of Persuasion Techniques in MemesComments: Accepted at SemEval 2024, NAACL 2024; 9 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Memes are one of the most popular types of content used in an online disinformation campaign. They are primarily effective on social media platforms since they can easily reach many users. Memes in a disinformation campaign achieve their goal of influencing the users through several rhetorical and psychological techniques, such as causal oversimplification, name-calling, and smear. The SemEval 2024 Task 4 \textit{Multilingual Detection of Persuasion Technique in Memes} on identifying such techniques in the memes is divided across three sub-tasks: ($\mathbf{1}$) Hierarchical multi-label classification using only textual content of the meme, ($\mathbf{2}$) Hierarchical multi-label classification using both, textual and visual content of the meme and ($\mathbf{3}$) Binary classification of whether the meme contains a persuasion technique or not using it's textual and visual content. This paper proposes an ensemble of Class Definition Prediction (CDP) and hyperbolic embeddings-based approaches for this task. We enhance meme classification accuracy and comprehensiveness by integrating HypEmo's hierarchical label embeddings (Chen et al., 2023) and a multi-task learning framework for emotion prediction. We achieve a hierarchical F1-score of 0.60, 0.67, and 0.48 on the respective sub-tasks.
- [819] arXiv:2404.04522 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Parameter Efficient Fine-Tuning (PEFT) methods have been extensively utilized in Large Language Models (LLMs) to improve the down-streaming tasks without the cost of fine-tuing the whole LLMs. Recent studies have shown how to effectively use PEFT for fine-tuning LLMs in ranking tasks with convincing performance; there are some limitations, including the learned prompt being fixed for different documents, overfitting to specific tasks, and low adaptation ability. In this paper, we introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak the information of the true queries to LLMs and then make the generation of true queries from input documents much easier. Specifically, we utilize the query to extract the top-$k$ tokens from concatenated documents, serving as contextual clues. We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents, guiding the LLMs to generate more document-specific synthetic queries, thereby further improving the reranking performance. Extensive experiments are conducted on four public datasets, demonstrating the effectiveness of our proposed approach.
- [820] arXiv:2404.04525 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IITK at SemEval-2024 Task 10: Who is the speaker? Improving Emotion Recognition and Flip Reasoning in Conversations via Speaker EmbeddingsComments: Accepted at SemEval 2024, NAACL 2024; 10 PagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper presents our approach for the SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversations. For the Emotion Recognition in Conversations (ERC) task, we utilize a masked-memory network along with speaker participation. We propose a transformer-based speaker-centric model for the Emotion Flip Reasoning (EFR) task. We also introduce Probable Trigger Zone, a region of the conversation that is more likely to contain the utterances causing the emotion to flip. For sub-task 3, the proposed approach achieves a 5.9 (F1 score) improvement over the task baseline. The ablation study results highlight the significance of various design choices in the proposed method.
- [821] arXiv:2404.04527 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGASachini Wickramasinghe , Dhruv Parikh , Bingyi Zhang , Rajgopal Kannan , Viktor Prasanna , Carl BusartComments: SPIE DCS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality; the standard SAR datasets, however, have a limited number of labeled training data which reduces the learning capability of ViTs; (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without any pre-training by utilizing the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules. We directly train this model on SAR datasets which have limited training samples to evaluate its effectiveness for SAR ATR applications. We evaluate our proposed model, that we call VTR (ViT for SAR ATR), on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Further, we propose a novel FPGA accelerator for VTR, in order to enable deployment for real-time SAR ATR applications.
- [822] arXiv:2404.04534 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Impact of Fairness Regulations on Institutions' Policies and Population QualificationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The proliferation of algorithmic systems has fueled discussions surrounding the regulation and control of their social impact. Herein, we consider a system whose primary objective is to maximize utility by selecting the most qualified individuals. To promote demographic parity in the selection algorithm, we consider penalizing discrimination across social groups. We examine conditions under which a discrimination penalty can effectively reduce disparity in the selection. Additionally, we explore the implications of such a penalty when individual qualifications may evolve over time in response to the imposed penalizing policy. We identify scenarios where the penalty could hinder the natural attainment of equity within the population. Moreover, we propose certain conditions that can counteract this undesirable outcome, thus ensuring fairness.
- [823] arXiv:2404.04544 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained DiffusionComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: this https URL .
- [824] arXiv:2404.04547 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Exhaustive Exploitation of Nature-inspired Computation for Cancer Screening in an Ensemble MannerSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Accurate screening of cancer types is crucial for effective cancer detection and precise treatment selection. However, the association between gene expression profiles and tumors is often limited to a small number of biomarker genes. While computational methods using nature-inspired algorithms have shown promise in selecting predictive genes, existing techniques are limited by inefficient search and poor generalization across diverse datasets. This study presents a framework termed Evolutionary Optimized Diverse Ensemble Learning (EODE) to improve ensemble learning for cancer classification from gene expression data. The EODE methodology combines an intelligent grey wolf optimization algorithm for selective feature space reduction, guided random injection modeling for ensemble diversity enhancement, and subset model optimization for synergistic classifier combinations. Extensive experiments were conducted across 35 gene expression benchmark datasets encompassing varied cancer types. Results demonstrated that EODE obtained significantly improved screening accuracy over individual and conventionally aggregated models. The integrated optimization of advanced feature selection, directed specialized modeling, and cooperative classifier ensembles helps address key challenges in current nature-inspired approaches. This provides an effective framework for robust and generalized ensemble learning with gene expression biomarkers. Specifically, we have opened EODE source code on Github at this https URL .
- [825] arXiv:2404.04564 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Video Summarization with Context AwarenessComments: 115 pages, 1 supplementary paper, undergraduate thesis report at US-VNUHCMSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Video summarization is a crucial research area that aims to efficiently browse and retrieve relevant information from the vast amount of video content available today. With the exponential growth of multimedia data, the ability to extract meaningful representations from videos has become essential. Video summarization techniques automatically generate concise summaries by selecting keyframes, shots, or segments that capture the video's essence. This process improves the efficiency and accuracy of various applications, including video surveillance, education, entertainment, and social media. Despite the importance of video summarization, there is a lack of diverse and representative datasets, hindering comprehensive evaluation and benchmarking of algorithms. Existing evaluation metrics also fail to fully capture the complexities of video summarization, limiting accurate algorithm assessment and hindering the field's progress. To overcome data scarcity challenges and improve evaluation, we propose an unsupervised approach that leverages video data structure and information for generating informative summaries. By moving away from fixed annotations, our framework can produce representative summaries effectively. Moreover, we introduce an innovative evaluation pipeline tailored specifically for video summarization. Human participants are involved in the evaluation, comparing our generated summaries to ground truth summaries and assessing their informativeness. This human-centric approach provides valuable insights into the effectiveness of our proposed techniques. Experimental results demonstrate that our training-free framework outperforms existing unsupervised approaches and achieves competitive results compared to state-of-the-art supervised methods.
- [826] arXiv:2404.04575 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DROComments: 41 pages, 10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: The temperature parameter plays a profound role during training and/or inference with large foundation models (LFMs) such as large language models (LLMs) and CLIP models. Particularly, it adjusts the logits in the softmax function in LLMs, which is crucial for next token generation, and it scales the similarities in the contrastive loss for training CLIP models. A significant question remains: Is it viable to learn a neural network to predict a personalized temperature of any input data for enhancing LFMs"? In this paper, we present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs. Our solution is composed of a novel learning framework with a robust loss underpinned by constrained distributionally robust optimization (DRO), and a properly designed TempNet with theoretical inspiration. TempNet can be trained together with a large foundation model from scratch or learned separately given a pretrained foundation model. It is not only useful for predicting personalized temperature to promote the training of LFMs but also generalizable and transferable to new tasks. Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models, e.g. Table 1. The code to reproduce the experimental results in this paper can be found at this https URL .
- [827] arXiv:2404.04587 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Neuroevolving Electronic Dynamical NetworksComments: 8 pages, 3 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: Neuroevolution is a powerful method of applying an evolutionary algorithm to refine the performance of artificial neural networks through natural selection; however, the fitness evaluation of these networks can be time-consuming and computationally expensive, particularly for continuous time recurrent neural networks (CTRNNs) that necessitate the simulation of differential equations. To overcome this challenge, field programmable gate arrays (FPGAs) have emerged as an increasingly popular solution, due to their high performance and low power consumption. Further, their ability to undergo dynamic and partial reconfiguration enables the extremely rapid evaluation of the fitness of CTRNNs, effectively addressing the bottleneck associated with conventional methods of evolvable hardware. By incorporating fitness evaluation directly upon the programmable logic of the FPGA, hyper-parallel evaluation becomes feasible, dramatically reducing the time required for assessment. This inherent parallelism of FPGAs accelerates the entire neuroevolutionary process by several orders of magnitude, facilitating faster convergence to an optimal solution. The work presented in this study demonstrates the potential of utilizing dynamic and partial reconfiguration on capable FPGAs as a powerful platform for neuroevolving dynamic neural networks.
- [828] arXiv:2404.04608 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image InterpretationJournal-ref: IEEE Transactions on Geoscience and Remote Sensing, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Current remote-sensing interpretation models often focus on a single task such as detection, segmentation, or caption. However, the task-specific designed models are unattainable to achieve the comprehensive multi-level interpretation of images. The field also lacks support for multi-task joint interpretation datasets. In this paper, we propose Panoptic Perception, a novel task and a new fine-grained dataset (FineGrip) to achieve a more thorough and universal interpretation for RSIs. The new task, 1) integrates pixel-level, instance-level, and image-level information for universal image perception, 2) captures image information from coarse to fine granularity, achieving deeper scene understanding and description, and 3) enables various independent tasks to complement and enhance each other through multi-task learning. By emphasizing multi-task interactions and the consistency of perception results, this task enables the simultaneous processing of fine-grained foreground instance segmentation, background semantic segmentation, and global fine-grained image captioning. Concretely, the FineGrip dataset includes 2,649 remote sensing images, 12,054 fine-grained instance segmentation masks belonging to 20 foreground things categories, 7,599 background semantic masks for 5 stuff classes and 13,245 captioning sentences. Furthermore, we propose a joint optimization-based panoptic perception model. Experimental results on FineGrip demonstrate the feasibility of the panoptic perception task and the beneficial effect of multi-task joint optimization on individual tasks. The dataset will be publicly available.
- [829] arXiv:2404.04615 (cross-list from physics.flu-dyn) [ pdf , ps , html , other ]
-
Title: PointSAGE: Mesh-independent superresolution approach to fluid flow predictionsRajat Sarkar , Krishna Sai Sudhir Aripirala , Vishal Sudam Jadhav , Sagar Srinivas Sakhinana , Venkataramana RunkanaComments: Accepted at Data-Centric Machine Learning Workshop (DMLR) at ICLR, 2024Subjects: Fluid Dynamics (physics.flu-dyn) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Computational Fluid Dynamics (CFD) serves as a powerful tool for simulating fluid flow across diverse industries. High-resolution CFD simulations offer valuable insights into fluid behavior and flow patterns, aiding in optimizing design features or enhancing system performance. However, as resolution increases, computational data requirements and time increase proportionately. This presents a persistent challenge in CFD. Recently, efforts have been directed towards accurately predicting fine-mesh simulations using coarse-mesh simulations, with geometry and boundary conditions as input. Drawing inspiration from models designed for super-resolution, deep learning techniques like UNets have been applied to address this challenge. However, these existing methods are limited to structured data and fail if the mesh is unstructured due to its inability to convolute. Additionally, incorporating geometry/mesh information in the training process introduces drawbacks such as increased data requirements, challenges in generalizing to unseen geometries for the same physical phenomena, and issues with robustness to mesh distortions. To address these concerns, we propose a novel framework, PointSAGE a mesh-independent network that leverages the unordered, mesh-less nature of Pointcloud to learn the complex fluid flow and directly predict fine simulations, completely neglecting mesh information. Utilizing an adaptable framework, the model accurately predicts the fine data across diverse point cloud sizes, regardless of the training dataset's dimension. We have evaluated the effectiveness of PointSAGE on diverse datasets in different scenarios, demonstrating notable results and a significant acceleration in computational time in generating fine simulations compared to standard CFD techniques.
- [830] arXiv:2404.04626 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Analyzing and Understanding the Limitations of DPO: A Theoretical PerspectiveComments: Draft versionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the SFT's effectiveness and its hindrance to the learning capacity towards human-preferred responses, leading to less satisfactory performance. To overcome those limitations, the theoretical understanding of DPO are indispensable but still lacking. To this end, we take a step towards theoretically analyzing and understanding the limitations of DPO. Specifically, we provide an analytical framework using the field theory to analyze the optimization process of DPO. By analyzing the gradient vector field of the DPO loss function, we find that the DPO loss function decreases the probability of producing human dispreferred data at a faster rate than it increases the probability of producing preferred data. This provides theoretical insights for understanding the limitations of DPO discovered in the related research experiments, thereby setting the foundation for its improvement.
- [831] arXiv:2404.04642 (cross-list from eess.IV) [ pdf , ps , other ]
-
Title: Power-Efficient Image Storage: Leveraging Super Resolution Generative Adversarial Network for Sustainable Compression and Reduced Carbon FootprintAshok Mondal (1), Satyam Singh (1) ((1) Vellore Institute of Technology, Chennai)Comments: 5 pages, 5 figuresSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In recent years, large-scale adoption of cloud storage solutions has revolutionized the way we think about digital data storage. However, the exponential increase in data volume, especially images, has raised environmental concerns regarding power and resource consumption, as well as the rising digital carbon footprint emissions. The aim of this research is to propose a methodology for cloud-based image storage by integrating image compression technology with SuperResolution Generative Adversarial Networks (SRGAN). Rather than storing images in their original format directly on the cloud, our approach involves initially reducing the image size through compression and downsizing techniques before storage. Upon request, these compressed images will be retrieved and processed by SRGAN to generate images. The efficacy of the proposed method is evaluated in terms of PSNR and SSIM metrics. Additionally, a mathematical analysis is given to calculate power consumption and carbon footprint assesment. The proposed data compression technique provides a significant solution to achieve a reasonable trade off between environmental sustainability and industrial efficiency.
- [832] arXiv:2404.04656 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Binary Classifier Optimization for Large Language Model AlignmentComments: 18 pages, 9 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Aligning Large Language Models (LLMs) to human preferences through preference optimization has been crucial but labor-intensive, necessitating for each prompt a comparison of both a chosen and a rejected text completion by evaluators. Recently, Kahneman-Tversky Optimization (KTO) has demonstrated that LLMs can be aligned using merely binary "thumbs-up" or "thumbs-down" signals on each prompt-completion pair. In this paper, we present theoretical foundations to explain the successful alignment achieved through these binary signals. Our analysis uncovers a new perspective: optimizing a binary classifier, whose logit is a reward, implicitly induces minimizing the Direct Preference Optimization (DPO) loss. In the process of this discovery, we identified two techniques for effective alignment: reward shift and underlying distribution matching. Consequently, we propose a new algorithm, \textit{Binary Classifier Optimization}, that integrates the techniques. We validate our methodology in two settings: first, on a paired preference dataset, where our method performs on par with DPO and KTO; and second, on binary signal datasets simulating real-world conditions with divergent underlying distributions between thumbs-up and thumbs-down data. Our model consistently demonstrates effective and robust alignment across two base LLMs and three different binary signal datasets, showcasing the strength of our approach to learning from binary feedback.
- [833] arXiv:2404.04661 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Transform then Explore: a Simple and Effective Technique for Exploratory Combinatorial Optimization with Reinforcement LearningTianle Pu , Changjun Fan , Mutian Shen , Yizhou Lu , Li Zeng , Zohar Nussinov , Chao Chen , Zhong LiuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Many complex problems encountered in both production and daily life can be conceptualized as combinatorial optimization problems (COPs) over graphs. Recent years, reinforcement learning (RL) based models have emerged as a promising direction, which treat the COPs solving as a heuristic learning problem. However, current finite-horizon-MDP based RL models have inherent limitations. They are not allowed to explore adquately for improving solutions at test time, which may be necessary given the complexity of NP-hard optimization tasks. Some recent attempts solve this issue by focusing on reward design and state feature engineering, which are tedious and ad-hoc. In this work, we instead propose a much simpler but more effective technique, named gauge transformation (GT). The technique is originated from physics, but is very effective in enabling RL agents to explore to continuously improve the solutions during test. Morever, GT is very simple, which can be implemented with less than 10 lines of Python codes, and can be applied to a vast majority of RL models. Experimentally, we show that traditional RL models with GT technique produce the state-of-the-art performances on the MaxCut problem. Furthermore, since GT is independent of any RL models, it can be seamlessly integrated into various RL frameworks, paving the way of these models for more effective explorations in the solving of general COPs.
- [834] arXiv:2404.04663 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Focused Active Learning for Histopathological Image ClassificationArne Schmidt , Pablo Morales-Álvarez , Lee A.D. Cooper , Lee A. Newberg , Andinet Enquobahrie , Aggelos K. Katsaggelos , Rafael MolinaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Active Learning (AL) has the potential to solve a major problem of digital pathology: the efficient acquisition of labeled data for machine learning algorithms. However, existing AL methods often struggle in realistic settings with artifacts, ambiguities, and class imbalances, as commonly seen in the medical field. The lack of precise uncertainty estimations leads to the acquisition of images with a low informative value. To address these challenges, we propose Focused Active Learning (FocAL), which combines a Bayesian Neural Network with Out-of-Distribution detection to estimate different uncertainties for the acquisition function. Specifically, the weighted epistemic uncertainty accounts for the class imbalance, aleatoric uncertainty for ambiguous images, and an OoD score for artifacts. We perform extensive experiments to validate our method on MNIST and the real-world Panda dataset for the classification of prostate cancer. The results confirm that other AL methods are 'distracted' by ambiguities and artifacts which harm the performance. FocAL effectively focuses on the most informative images, avoiding ambiguities and artifacts during acquisition. For both experiments, FocAL outperforms existing AL approaches, reaching a Cohen's kappa of 0.764 with only 0.69% of the labeled Panda data.
- [835] arXiv:2404.04665 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Adaptive Intra-Class Variation Contrastive Learning for Unsupervised Person Re-IdentificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The memory dictionary-based contrastive learning method has achieved remarkable results in the field of unsupervised person Re-ID. However, The method of updating memory based on all samples does not fully utilize the hardest sample to improve the generalization ability of the model, and the method based on hardest sample mining will inevitably introduce false-positive samples that are incorrectly clustered in the early stages of the model. Clustering-based methods usually discard a significant number of outliers, leading to the loss of valuable information. In order to address the issues mentioned before, we propose an adaptive intra-class variation contrastive learning algorithm for unsupervised Re-ID, called AdaInCV. And the algorithm quantitatively evaluates the learning ability of the model for each class by considering the intra-class variations after clustering, which helps in selecting appropriate samples during the training process of the model. To be more specific, two new strategies are proposed: Adaptive Sample Mining (AdaSaM) and Adaptive Outlier Filter (AdaOF). The first one gradually creates more reliable clusters to dynamically refine the memory, while the second can identify and filter out valuable outliers as negative samples.
- [836] arXiv:2404.04682 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Compositional Conservatism: A Transductive Approach in Offline Reinforcement LearningComments: ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. A common solution involves incorporating conservatism into the policy or the value function to safeguard against uncertainties and unknowns. In this work, we focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization (Netanyahu et al., 2023), which decomposes the input variable (the state in our case) into an anchor and its difference from the original input. Our COCOA seeks both in-distribution anchors and differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, where COCOA generally improves the performance of each algorithm. The code is available at this https URL .
- [837] arXiv:2404.04686 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AITaminul Islam , Md. Alif Sheakh , Mst. Sazia Tahosin , Most. Hasna Hena , Shopnil Akash , Yousef A. Bin Jardan , Gezahign Fentahun Wondmie , Hiba-Allah Nafidi , Mohammed BourhiaComments: Accepted for the Scientific Reports (Nature) journal. 32 pages, 12 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Breast cancer has rapidly increased in prevalence in recent years, making it one of the leading causes of mortality worldwide. Among all cancers, it is by far the most common. Diagnosing this illness manually requires significant time and expertise. Since detecting breast cancer is a time-consuming process, preventing its further spread can be aided by creating machine-based forecasts. Machine learning and Explainable AI are crucial in classification as they not only provide accurate predictions but also offer insights into how the model arrives at its decisions, aiding in the understanding and trustworthiness of the classification results. In this study, we evaluate and compare the classification accuracy, precision, recall, and F-1 scores of five different machine learning methods using a primary dataset (500 patients from Dhaka Medical College Hospital). Five different supervised machine learning techniques, including decision tree, random forest, logistic regression, naive bayes, and XGBoost, have been used to achieve optimal results on our dataset. Additionally, this study applied SHAP analysis to the XGBoost model to interpret the model's predictions and understand the impact of each feature on the model's output. We compared the accuracy with which several algorithms classified the data, as well as contrasted with other literature in this field. After final evaluation, this study found that XGBoost achieved the best model accuracy, which is 97%.
- [838] arXiv:2404.04714 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Data Poisoning Attacks on Off-Policy Policy Evaluation MethodsComments: Accepted at UAI 2022Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible, unethical, or expensive. However, the extent to which such methods can be trusted under adversarial threats to data quality is largely unexplored. In this work, we make the first attempt at investigating the sensitivity of OPE methods to marginal adversarial perturbations to the data. We design a generic data poisoning attack framework leveraging influence functions from robust statistics to carefully construct perturbations that maximize error in the policy value estimates. We carry out extensive experimentation with multiple healthcare and control datasets. Our results demonstrate that many existing OPE methods are highly prone to generating value estimates with large errors when subject to data poisoning attacks, even for small adversarial perturbations. These findings question the reliability of policy values derived using OPE methods and motivate the need for developing OPE methods that are statistically robust to train-time data poisoning attacks.
- [839] arXiv:2404.04718 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Interpretable Multimodal Learning for Cardiovascular Hemodynamics AssessmentPrasun C Tripathi , Sina Tabakhi , Mohammod N I Suvon , Lawrence Schöb , Samer Alabed , Andrew J Swift , Shuo Zhou , Haiping LuSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Pulmonary Arterial Wedge Pressure (PAWP) is an essential cardiovascular hemodynamics marker to detect heart failure. In clinical practice, Right Heart Catheterization is considered a gold standard for assessing cardiac hemodynamics while non-invasive methods are often needed to screen high-risk patients from a large population. In this paper, we propose a multimodal learning pipeline to predict PAWP marker. We utilize complementary information from Cardiac Magnetic Resonance Imaging (CMR) scans (short-axis and four-chamber) and Electronic Health Records (EHRs). We extract spatio-temporal features from CMR scans using tensor-based learning. We propose a graph attention network to select important EHR features for prediction, where we model subjects as graph nodes and feature relationships as graph edges using the attention mechanism. We design four feature fusion strategies: early, intermediate, late, and hybrid fusion. With a linear classifier and linear fusion strategies, our pipeline is interpretable. We validate our pipeline on a large dataset of $2,641$ subjects from our ASPIRE registry. The comparative study against state-of-the-art methods confirms the superiority of our pipeline. The decision curve analysis further validates that our pipeline can be applied to screen a large population. The code is available at this https URL .
- [840] arXiv:2404.04736 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ProtoAL: Interpretable Deep Active Learning with prototypes for medical imagingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The adoption of Deep Learning algorithms in the medical imaging field is a prominent area of research, with high potential for advancing AI-based Computer-aided diagnosis (AI-CAD) solutions. However, current solutions face challenges due to a lack of interpretability features and high data demands, prompting recent efforts to address these issues. In this study, we propose the ProtoAL method, where we integrate an interpretable DL model into the Deep Active Learning (DAL) framework. This approach aims to address both challenges by focusing on the medical imaging context and utilizing an inherently interpretable model based on prototypes. We evaluated ProtoAL on the Messidor dataset, achieving an area under the precision-recall curve of 0.79 while utilizing only 76.54\% of the available labeled data. These capabilities can enhances the practical usability of a DL model in the medical field, providing a means of trust calibration in domain experts and a suitable solution for learning in the data scarcity context often found.
- [841] arXiv:2404.04744 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Adapting Multi-objectivized Software Configuration TuningComments: This paper has been accepted at ACM FSE'24Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: When tuning software configuration for better performance (e.g., latency or throughput), an important issue that many optimizers face is the presence of local optimum traps, compounded by a highly rugged configuration landscape and expensive measurements. To mitigate these issues, a recent effort has shifted to focus on the level of optimization model (called meta multi-objectivization or MMO) instead of designing better optimizers as in traditional methods. This is done by using an auxiliary performance objective, together with the target performance objective, to help the search jump out of local optima. While effective, MMO needs a fixed weight to balance the two objectives-a parameter that has been found to be crucial as there is a large deviation of the performance between the best and the other settings. However, given the variety of configurable software systems, the "sweet spot" of the weight can vary dramatically in different cases and it is not possible to find the right setting without time-consuming trial and error. In this paper, we seek to overcome this significant shortcoming of MMO by proposing a weight adaptation method, dubbed AdMMO. Our key idea is to adaptively adjust the weight at the right time during tuning, such that a good proportion of the nondominated configurations can be maintained. Moreover, we design a partial duplicate retention mechanism to handle the issue of too many duplicate configurations without losing the rich information provided by the "good" duplicates.
Experiments on several real-world systems, objectives, and budgets show that, for 71% of the cases, AdMMO is considerably superior to MMO and a wide range of state-of-the-art optimizers while achieving generally better efficiency with the best speedup between 2.2x and 20x. - [842] arXiv:2404.04763 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role LabelingComments: 20 pages, 15 Figures, 13 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and structures, and they fail to generalize to new event types and domains. In this paper, we propose GenEARL, a training-free generative framework that harness the power of the modern generative models to understand event task descriptions given image contexts to perform the EARL task. Specifically, GenEARL comprises two stages of generative prompting with a frozen vision-language model (VLM) and a frozen large language model (LLM). First, a generative VLM learns the semantics of the event argument roles and generates event-centric object descriptions based on the image. Subsequently, a LLM is prompted with the generated object descriptions with a predefined template for EARL (i.e., assign an object with an event argument role). We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M2E2 and SwiG datasets, respectively. In addition, we outperform CLIP-Event by 22% precision on M2E2 dataset. The framework also allows flexible adaptation and generalization to unseen domains.
- [843] arXiv:2404.04814 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Inference-Time Rule Eraser: Distilling and Removing Bias Rules to Mitigate Bias in Deployed ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Machine learning models often make predictions based on biased features such as gender, race, and other social attributes, posing significant fairness risks, especially in societal applications, such as hiring, banking, and criminal justice. Traditional approaches to addressing this issue involve retraining or fine-tuning neural networks with fairness-aware optimization objectives. However, these methods can be impractical due to significant computational resources, complex industrial tests, and the associated CO2 footprint. Additionally, regular users aiming to use fair models often lack access to model parameters. In this paper, we introduce Inference-Time Rule Eraser (Eraser), a novel method focused on removing biased decision-making rules during inference to address fairness concerns without modifying model weights. We begin by establishing a theoretical foundation for modifying model outputs to eliminate biased rules through Bayesian analysis. Next, we present a specific implementation of Eraser that involves two stages: (1) querying the model to distill biased rules into a patched model, and (2) excluding these biased rules during inference. Extensive experiments validate the effectiveness of our approach, showcasing its superior performance in addressing fairness concerns in AI systems.
- [844] arXiv:2404.04821 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: A Data-to-Product Multimodal Conceptual Framework to Achieve Automated Software Evolution for Context-rich Intelligent ApplicationsSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: While AI is extensively transforming Software Engineering (SE) fields, SE is still in need of a framework to overall consider all phases to facilitate Automated Software Evolution (ASEv), particularly for intelligent applications that are context-rich, instead of conquering each division independently. Its complexity comes from the intricacy of the intelligent applications, the heterogeneity of the data sources, and the constant changes in the context. This study proposes a conceptual framework for achieving automated software evolution, emphasizing the importance of multimodality learning. A Selective Sequential Scope Model (3S) model is developed based on the conceptual framework, and it can be used to categorize existing and future research when it covers different SE phases and multimodal learning tasks. This research is a preliminary step toward the blueprint of a higher-level ASEv. The proposed conceptual framework can act as a practical guideline for practitioners to prepare themselves for diving into this area. Although the study is about intelligent applications, the framework and analysis methods may be adapted for other types of software as AI brings more intelligence into their life cycles.
- [845] arXiv:2404.04824 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Mixup Domain Adaptations for Dynamic Remaining Useful Life PredictionsComments: accepted for publication in Knowledge-based SystemsJournal-ref: Knowledge-based Systems, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Remaining Useful Life (RUL) predictions play vital role for asset planning and maintenance leading to many benefits to industries such as reduced downtime, low maintenance costs, etc. Although various efforts have been devoted to study this topic, most existing works are restricted for i.i.d conditions assuming the same condition of the training phase and the deployment phase. This paper proposes a solution to this problem where a mix-up domain adaptation (MDAN) is put forward. MDAN encompasses a three-staged mechanism where the mix-up strategy is not only performed to regularize the source and target domains but also applied to establish an intermediate mix-up domain where the source and target domains are aligned. The self-supervised learning strategy is implemented to prevent the supervision collapse problem. Rigorous evaluations have been performed where MDAN is compared to recently published works for dynamic RUL predictions. MDAN outperforms its counterparts with substantial margins in 12 out of 12 cases. In addition, MDAN is evaluated with the bearing machine dataset where it beats prior art with significant gaps in 8 of 12 cases. Source codes of MDAN are made publicly available in \url{ this https URL }.
- [846] arXiv:2404.04825 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Gradient-based Design of Computational Granular CrystalsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
Abstract: There is growing interest in engineering unconventional computing devices that leverage the intrinsic dynamics of physical substrates to perform fast and energy-efficient computations. Granular metamaterials are one such substrate that has emerged as a promising platform for building wave-based information processing devices with the potential to integrate sensing, actuation, and computation. Their high-dimensional and nonlinear dynamics result in nontrivial and sometimes counter-intuitive wave responses that can be shaped by the material properties, geometry, and configuration of individual grains. Such highly tunable rich dynamics can be utilized for mechanical computing in special-purpose applications. However, there are currently no general frameworks for the inverse design of large-scale granular materials. Here, we build upon the similarity between the spatiotemporal dynamics of wave propagation in material and the computational dynamics of Recurrent Neural Networks to develop a gradient-based optimization framework for harmonically driven granular crystals. We showcase how our framework can be utilized to design basic logic gates where mechanical vibrations carry the information at predetermined frequencies. We compare our design methodology with classic gradient-free methods and find that our approach discovers higher-performing configurations with less computational effort. Our findings show that a gradient-based optimization method can greatly expand the design space of metamaterials and provide the opportunity to systematically traverse the parameter space to find materials with the desired functionalities.
- [847] arXiv:2404.04839 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: AI for DevSecOps: A Landscape and Future OpportunitiesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: DevOps has emerged as one of the most rapidly evolving software development paradigms. With the growing concerns surrounding security in software systems, the DevSecOps paradigm has gained prominence, urging practitioners to incorporate security practices seamlessly into the DevOps workflow. However, integrating security into the DevOps workflow can impact agility and impede delivery speed. Recently, the advancement of artificial intelligence (AI) has revolutionized automation in various software domains, including software security. AI-driven security approaches, particularly those leveraging machine learning or deep learning, hold promise in automating security workflows. They reduce manual efforts, which can be integrated into DevOps to ensure uninterrupted delivery speed and align with the DevSecOps paradigm simultaneously. This paper seeks to contribute to the critical intersection of AI and DevSecOps by presenting a comprehensive landscape of AI-driven security techniques applicable to DevOps and identifying avenues for enhancing security, trust, and efficiency in software development processes. We analyzed 99 research papers spanning from 2017 to 2023. Specifically, we address two key research questions (RQs). In RQ1, we identified 12 security tasks associated with the DevOps process and reviewed existing AI-driven security approaches. In RQ2, we discovered 15 challenges encountered by existing AI-driven security approaches and derived future research opportunities. Drawing insights from our findings, we discussed the state-of-the-art AI-driven security approaches, highlighted challenges in existing research, and proposed avenues for future opportunities.
- [848] arXiv:2404.04845 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SLPL SHROOM at SemEval2024 Task 06: A comprehensive study on models ability to detect hallucinationPouya Fallah , Soroush Gooran , Mohammad Jafarinasab , Pouya Sadeghi , Reza Farnia , Amirreza Tarabkhah , Zainab Sadat Taghavi , Hossein SametiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Language models, particularly generative models, are susceptible to hallucinations, generating outputs that contradict factual knowledge or the source text. This study explores methods for detecting hallucinations in three SemEval-2024 Task 6 tasks: Machine Translation, Definition Modeling, and Paraphrase Generation. We evaluate two methods: semantic similarity between the generated text and factual references, and an ensemble of language models that judge each other's outputs. Our results show that semantic similarity achieves moderate accuracy and correlation scores in trial data, while the ensemble method offers insights into the complexities of hallucination detection but falls short of expectations. This work highlights the challenges of hallucination detection and underscores the need for further research in this critical area.
- [849] arXiv:2404.04848 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Task-Aware Encoder Control for Deep Video CompressionXingtong Ge , Jixiang Luo , Xinjie Zhang , Tongda Xu , Guo Lu , Dailan He , Jing Geng , Yan Wang , Jun Zhang , Hongwei QinComments: Accepted by CVPR 2024Subjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Prior research on deep video compression (DVC) for machine tasks typically necessitates training a unique codec for each specific task, mandating a dedicated decoder per task. In contrast, traditional video codecs employ a flexible encoder controller, enabling the adaptation of a single codec to different tasks through mechanisms like mode prediction. Drawing inspiration from this, we introduce an innovative encoder controller for deep video compression for machines. This controller features a mode prediction and a Group of Pictures (GoP) selection module. Our approach centralizes control at the encoding stage, allowing for adaptable encoder adjustments across different tasks, such as detection and tracking, while maintaining compatibility with a standard pre-trained DVC decoder. Empirical evidence demonstrates that our method is applicable across multiple tasks with various existing pre-trained DVCs. Moreover, extensive experiments demonstrate that our method outperforms previous DVC by about 25% bitrate for different tasks, with only one pre-trained decoder.
- [850] arXiv:2404.04849 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain InjectionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.
- [851] arXiv:2404.04854 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Contextual Chart Generation for Cyber DeceptionComments: 13 pages including referencesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Honeyfiles are security assets designed to attract and detect intruders on compromised systems. Honeyfiles are a type of honeypot that mimic real, sensitive documents, creating the illusion of the presence of valuable data. Interaction with a honeyfile reveals the presence of an intruder, and can provide insights into their goals and intentions. Their practical use, however, is limited by the time, cost and effort associated with manually creating realistic content. The introduction of large language models has made high-quality text generation accessible, but honeyfiles contain a variety of content including charts, tables and images. This content needs to be plausible and realistic, as well as semantically consistent both within honeyfiles and with the real documents they mimic, to successfully deceive an intruder.
In this paper, we focus on an important component of the honeyfile content generation problem: document charts. Charts are ubiquitous in corporate documents and are commonly used to communicate quantitative and scientific data. Existing image generation models, such as DALL-E, are rather prone to generating charts with incomprehensible text and unconvincing data. We take a multi-modal approach to this problem by combining two purpose-built generative models: a multitask Transformer and a specialized multi-head autoencoder. The Transformer generates realistic captions and plot text, while the autoencoder generates the underlying tabular data for the plot.
To advance the field of automated honeyplot generation, we also release a new document-chart dataset and propose a novel metric Keyword Semantic Matching (KSM). This metric measures the semantic consistency between keywords of a corpus and a smaller bag of words. Extensive experiments demonstrate excellent performance against multiple large language models, including ChatGPT and GPT4. - [852] arXiv:2404.04856 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Msmsfnet: a multi-stream and multi-scale fusion net for edge detectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Edge detection is a long standing problem in computer vision. Recent deep learning based algorithms achieve state of-the-art performance in publicly available datasets. Despite the efficiency of these algorithms, their performance, however, relies heavily on the pretrained weights of the backbone network on the ImageNet dataset. This limits heavily the design space of deep learning based edge detectors. Whenever we want to devise a new model, we have to train this new model on the ImageNet dataset first, and then fine tune the model using the edge detection datasets. The comparison would be unfair otherwise. However, it is usually not feasible for many researchers to train a model on the ImageNet dataset due to the limited computation resources. In this work, we study the performance that can be achieved by state-of-the-art deep learning based edge detectors in publicly available datasets when they are trained from scratch, and devise a new network architecture, the multi-stream and multi scale fusion net (msmsfnet), for edge detection. We show in our experiments that by training all models from scratch to ensure the fairness of comparison, out model outperforms state-of-the art deep learning based edge detectors in three publicly available datasets.
- [853] arXiv:2404.04869 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The utilization of Large Language Models (LLMs) within the realm of reinforcement learning, particularly as planners, has garnered a significant degree of attention in recent scholarly literature. However, a substantial proportion of existing research predominantly focuses on planning models for robotics that transmute the outputs derived from perception models into linguistic forms, thus adopting a `pure-language' strategy. In this research, we propose a hybrid End-to-End learning framework for autonomous driving by combining basic driving imitation learning with LLMs based on multi-modality prompt tokens. Instead of simply converting perception results from the separated train model into pure language input, our novelty lies in two aspects. 1) The end-to-end integration of visual and LiDAR sensory input into learnable multi-modality tokens, thereby intrinsically alleviating description bias by separated pre-trained perception models. 2) Instead of directly letting LLMs drive, this paper explores a hybrid setting of letting LLMs help the driving model correct mistakes and complicated scenarios. The results of our experiments suggest that the proposed methodology can attain driving scores of 49.21%, coupled with an impressive route completion rate of 91.34% in the offline evaluation conducted via CARLA. These performance metrics are comparable to the most advanced driving models.
- [854] arXiv:2404.04871 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Data Stream Sampling with Fuzzy Task Boundaries and Noisy LabelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In the realm of continual learning, the presence of noisy labels within data streams represents a notable obstacle to model reliability and fairness. We focus on the data stream scenario outlined in pertinent literature, characterized by fuzzy task boundaries and noisy labels. To address this challenge, we introduce a novel and intuitive sampling method called Noisy Test Debiasing (NTD) to mitigate noisy labels in evolving data streams and establish a fair and robust continual learning algorithm. NTD is straightforward to implement, making it feasible across various scenarios. Our experiments benchmark four datasets, including two synthetic noise datasets (CIFAR10 and CIFAR100) and real-world noise datasets (mini-WebVision and Food-101N). The results validate the efficacy of NTD for online continual learning in scenarios with noisy labels in data streams. Compared to the previous leading approach, NTD achieves a training speedup enhancement over two times while maintaining or surpassing accuracy levels. Moreover, NTD utilizes less than one-fifth of the GPU memory resources compared to previous leading methods.
- [855] arXiv:2404.04874 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph Neural Networks for Binary ProgrammingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: This paper investigates a link between Graph Neural Networks (GNNs) and Binary Programming (BP) problems, laying the groundwork for GNNs to approximate solutions for these computationally challenging problems. By analyzing the sensitivity of BP problems, we are able to frame the solution of BP problems as a heterophilic node classification task. We then propose Binary-Programming GNN (BPGNN), an architecture that integrates graph representation learning techniques with BP-aware features to approximate BP solutions efficiently. Additionally, we introduce a self-supervised data generation mechanism, to enable efficient and tractable training data acquisition even for large-scale BP problems. Experimental evaluations of BPGNN across diverse BP problem sizes showcase its superior performance compared to exhaustive search and heuristic approaches. Finally, we discuss open challenges in the under-explored field of BP problems with GNNs.
- [856] arXiv:2404.04886 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: PagPassGPT: Pattern Guided Password Guessing via Generative Pretrained TransformerSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Amidst the surge in deep learning-based password guessing models, challenges of generating high-quality passwords and reducing duplicate passwords persist. To address these challenges, we present PagPassGPT, a password guessing model constructed on Generative Pretrained Transformer (GPT). It can perform pattern guided guessing by incorporating pattern structure information as background knowledge, resulting in a significant increase in the hit rate. Furthermore, we propose D&C-GEN to reduce the repeat rate of generated passwords, which adopts the concept of a divide-and-conquer approach. The primary task of guessing passwords is recursively divided into non-overlapping subtasks. Each subtask inherits the knowledge from the parent task and predicts succeeding tokens. In comparison to the state-of-the-art model, our proposed scheme exhibits the capability to correctly guess 12% more passwords while producing 25% fewer duplicates.
- [857] arXiv:2404.04903 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Online Learning under Haphazard Input Conditions: A Comprehensive Review and AnalysisSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The domain of online learning has experienced multifaceted expansion owing to its prevalence in real-life applications. Nonetheless, this progression operates under the assumption that the input feature space of the streaming data remains constant. In this survey paper, we address the topic of online learning in the context of haphazard inputs, explicitly foregoing such an assumption. We discuss, classify, evaluate, and compare the methodologies that are adept at modeling haphazard inputs, additionally providing the corresponding code implementations and their carbon footprint. Moreover, we classify the datasets related to the field of haphazard inputs and introduce evaluation metrics specifically designed for datasets exhibiting imbalance. The code of each methodology can be found at this https URL
- [858] arXiv:2404.04904 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Cross-Domain Audio Deepfake Detection: Dataset and AnalysisSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1\% and 6.5\% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.
- [859] arXiv:2404.04905 (cross-list from stat.ME) [ pdf , ps , html , other ]
-
Title: Review for Handling Missing Data with special missing mechanismSubjects: Methodology (stat.ME) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Missing data poses a significant challenge in data science, affecting decision-making processes and outcomes. Understanding what missing data is, how it occurs, and why it is crucial to handle it appropriately is paramount when working with real-world data, especially in tabular data, one of the most commonly used data types in the real world. Three missing mechanisms are defined in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), each presenting unique challenges in imputation. Most existing work are focused on MCAR that is relatively easy to handle. The special missing mechanisms of MNAR and MAR are less explored and understood. This article reviews existing literature on handling missing values. It compares and contrasts existing methods in terms of their ability to handle different missing mechanisms and data types. It identifies research gap in the existing literature and lays out potential directions for future research in the field. The information in this review will help data analysts and researchers to adopt and promote good practices for handling missing data in real-world problems.
- [860] arXiv:2404.04922 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Efficient Learnable Collaborative Attention for Single Image Super-ResolutionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Non-Local Attention (NLA) is a powerful technique for capturing long-range feature correlations in deep single image super-resolution (SR). However, NLA suffers from high computational complexity and memory consumption, as it requires aggregating all non-local feature information for each query response and recalculating the similarity weight distribution for different abstraction levels of features. To address these challenges, we propose a novel Learnable Collaborative Attention (LCoA) that introduces inductive bias into non-local modeling. Our LCoA consists of two components: Learnable Sparse Pattern (LSP) and Collaborative Attention (CoA). LSP uses the k-means clustering algorithm to dynamically adjust the sparse attention pattern of deep features, which reduces the number of non-local modeling rounds compared with existing sparse solutions. CoA leverages the sparse attention pattern and weights learned by LSP, and co-optimizes the similarity matrix across different abstraction levels, which avoids redundant similarity matrix calculations. The experimental results show that our LCoA can reduce the non-local modeling time by about 83% in the inference stage. In addition, we integrate our LCoA into a deep Learnable Collaborative Attention Network (LCoAN), which achieves competitive performance in terms of inference time, memory consumption, and reconstruction quality compared with other state-of-the-art SR methods.
- [861] arXiv:2404.04924 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small DatasetsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. The code for our proposed model is publicly available on the website.
- [862] arXiv:2404.04932 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Understanding the Influence of Reward Margin on Preference Model PerformanceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely used framework for the training of language models. However, the process of using RLHF to develop a language model that is well-aligned presents challenges, especially when it comes to optimizing the reward model. Our research has found that existing reward models, when trained using the traditional ranking objective based on human preference data, often struggle to effectively distinguish between responses that are more or less favorable in real-world scenarios. To bridge this gap, our study introduces a novel method to estimate the preference differences without the need for detailed, exhaustive labels from human annotators. Our experimental results provide empirical evidence that incorporating margin values into the training process significantly improves the effectiveness of reward models. This comparative analysis not only demonstrates the superiority of our approach in terms of reward prediction accuracy but also highlights its effectiveness in practical applications.
- [863] arXiv:2404.04943 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Chiplet Placement Order Exploration Based on Learning to Rank with Graph RepresentationComments: 6 pages, 8 figures and 6 tables, accepted by the Conference ISEDASubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: Chiplet-based systems, integrating various silicon dies manufactured at different integrated circuit technology nodes on a carrier interposer, have garnered significant attention in recent years due to their cost-effectiveness and competitive performance. The widespread adoption of reinforcement learning as a sequential placement method has introduced a new challenge in determining the optimal placement order for each chiplet. The order in which chiplets are placed on the interposer influences the spatial resources available for earlier and later placed chiplets, making the placement results highly sensitive to the sequence of chiplet placement. To address these challenges, we propose a learning to rank approach with graph representation, building upon the reinforcement learning framework RLPlanner. This method aims to select the optimal chiplet placement order for each chiplet-based system. Experimental results demonstrate that compared to placement order obtained solely based on the descending order of the chiplet area and the number of interconnect wires between the chiplets, utilizing the placement order obtained from the learning to rank network leads to further improvements in system temperature and inter-chiplet wirelength. Specifically, applying the top-ranked placement order obtained from the learning to rank network results in a 10.05% reduction in total inter-chiplet wirelength and a 1.01% improvement in peak system temperature during the chiplet placement process.
- [864] arXiv:2404.04947 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: Gull: A Generative Multifunctional Audio CodecComments: Demo page: this https URLSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
Abstract: We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules for simpler training, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics.
- [865] arXiv:2404.04963 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical TrialsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs.These shortcomings are especially critical in medical contexts, where they can misrepresent actual model capabilities. Addressing this, we present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for ClinicalTrials. Our contributions include the refined NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials - Perturbed), designed to challenge LLMs with interventional and causal reasoning tasks, along with a comprehensive evaluation of methods and results for participant submissions. A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers. This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. We anticipate that the dataset, models, and outcomes of this task can support future research in the field of biomedical NLI. The dataset, competition leaderboard, and website are publicly available.
- [866] arXiv:2404.04969 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Temporal Generalization Estimation in Evolving GraphsComments: Published as a conference paper at ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph Neural Networks (GNNs) are widely deployed in vast fields, but they often struggle to maintain accurate representations as graphs evolve. We theoretically establish a lower bound, proving that under mild conditions, representation distortion inevitably occurs over time. To estimate the temporal distortion without human annotation after deployment, one naive approach is to pre-train a recurrent model (e.g., RNN) before deployment and use this model afterwards, but the estimation is far from satisfactory. In this paper, we analyze the representation distortion from an information theory perspective, and attribute it primarily to inaccurate feature extraction during evolution. Consequently, we introduce Smart, a straightforward and effective baseline enhanced by an adaptive feature extractor through self-supervised graph reconstruction. In synthetic random graphs, we further refine the former lower bound to show the inevitable distortion over time and empirically observe that Smart achieves good estimation performance. Moreover, we observe that Smart consistently shows outstanding generalization estimation on four real-world evolving graphs. The ablation studies underscore the necessity of graph reconstruction. For example, on OGB-arXiv dataset, the estimation metric MAPE deteriorates from 2.19% to 8.00% without reconstruction.
- [867] arXiv:2404.04974 (cross-list from econ.EM) [ pdf , ps , html , other ]
-
Title: Neural Network Modeling for Forecasting Tourism Demand in Stopi\'{c}a Cave: A Serbian Cave Tourism StudySubjects: Econometrics (econ.EM) ; Artificial Intelligence (cs.AI)
Abstract: For modeling the number of visits in Stopića cave (Serbia) we consider the classical Auto-regressive Integrated Moving Average (ARIMA) model, Machine Learning (ML) method Support Vector Regression (SVR), and hybrid NeuralPropeth method which combines classical and ML concepts. The most accurate predictions were obtained with NeuralPropeth which includes the seasonal component and growing trend of time-series. In addition, non-linearity is modeled by shallow Neural Network (NN), and Google Trend is incorporated as an exogenous variable. Modeling tourist demand represents great importance for management structures and decision-makers due to its applicability in establishing sustainable tourism utilization strategies in environmentally vulnerable destinations such as caves. The data provided insights into the tourist demand in Stopića cave and preliminary data for addressing the issues of carrying capacity within the most visited cave in Serbia.
- [868] arXiv:2404.04983 (cross-list from q-bio.TO) [ pdf , ps , other ]
-
Title: Primary liver cancer classification from routine tumour biopsy using weakly supervised deep learningAurélie Beaufrère , Nora Ouzir , Paul Emile Zafar , Astrid Laurent-Bellue , Miguel Albuquerque , Gwladys Lubuela , Jules Grégory , Catherine Guettier , Kévin Mondet , Jean-Christophe Pesquet , Valérie ParadisComments: this https URLJournal-ref: JHEP Reports, Volume 6, Issue 3, 2024Subjects: Tissues and Organs (q-bio.TO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The diagnosis of primary liver cancers (PLCs) can be challenging, especially on biopsies and for combined hepatocellular-cholangiocarcinoma (cHCC-CCA). We automatically classified PLCs on routine-stained biopsies using a weakly supervised learning method. Weak tumour/non-tumour annotations served as labels for training a Resnet18 neural network, and the network's last convolutional layer was used to extract new tumour tile features. Without knowledge of the precise labels of the malignancies, we then applied an unsupervised clustering algorithm. Our model identified specific features of hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (iCCA). Despite no specific features of cHCC-CCA being recognized, the identification of HCC and iCCA tiles within a slide could facilitate the diagnosis of primary liver cancers, particularly cHCC-CCA.
Method and results: 166 PLC biopsies were divided into training, internal and external validation sets: 90, 29 and 47 samples. Two liver pathologists reviewed each whole-slide hematein eosin saffron (HES)-stained image (WSI). After annotating the tumour/non-tumour areas, 256x256 pixel tiles were extracted from the WSIs and used to train a ResNet18. The network was used to extract new tile features. An unsupervised clustering algorithm was then applied to the new tile features. In a two-cluster model, Clusters 0 and 1 contained mainly HCC and iCCA histological features. The diagnostic agreement between the pathological diagnosis and the model predictions in the internal and external validation sets was 100% (11/11) and 96% (25/26) for HCC and 78% (7/9) and 87% (13/15) for iCCA, respectively. For cHCC-CCA, we observed a highly variable proportion of tiles from each cluster (Cluster 0: 5-97%; Cluster 1: 2-94%). - [869] arXiv:2404.04986 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Dynamic Distinction Learning: Adaptive Pseudo Anomalies for Video Anomaly DetectionComments: To be published in the CVPR2024 WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We introduce Dynamic Distinction Learning (DDL) for Video Anomaly Detection, a novel video anomaly detection methodology that combines pseudo-anomalies, dynamic anomaly weighting, and a distinction loss function to improve detection accuracy. By training on pseudo-anomalies, our approach adapts to the variability of normal and anomalous behaviors without fixed anomaly thresholds. Our model showcases superior performance on the Ped2, Avenue and ShanghaiTech datasets, where individual models are tailored for each scene. These achievements highlight DDL's effectiveness in advancing anomaly detection, offering a scalable and adaptable solution for video surveillance challenges.
- [870] arXiv:2404.04997 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Adapting LLMs for Efficient Context Processing through Soft Prompt CompressionCangqing Wang , Yutian Yang , Ruisi Li , Dan Sun , Ruicong Cai , Yuzhu Zhang , Chengqian Fu , Lillian FloydComments: This paper has been accepted by the 2024 International Conference on Image Processing and Computer Applications (IPCA 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The rapid advancement of Large Language Models (LLMs) has inaugurated a transformative epoch in natural language processing, fostering unprecedented proficiency in text generation, comprehension, and contextual scrutiny. Nevertheless, effectively handling extensive contexts, crucial for myriad applications, poses a formidable obstacle owing to the intrinsic constraints of the models' context window sizes and the computational burdens entailed by their operations. This investigation presents an innovative framework that strategically tailors LLMs for streamlined context processing by harnessing the synergies among natural language summarization, soft prompt compression, and augmented utility preservation mechanisms. Our methodology, dubbed SoftPromptComp, amalgamates natural language prompts extracted from summarization methodologies with dynamically generated soft prompts to forge a concise yet semantically robust depiction of protracted contexts. This depiction undergoes further refinement via a weighting mechanism optimizing information retention and utility for subsequent tasks. We substantiate that our framework markedly diminishes computational overhead and enhances LLMs' efficacy across various benchmarks, while upholding or even augmenting the caliber of the produced content. By amalgamating soft prompt compression with sophisticated summarization, SoftPromptComp confronts the dual challenges of managing lengthy contexts and ensuring model scalability. Our findings point towards a propitious trajectory for augmenting LLMs' applicability and efficiency, rendering them more versatile and pragmatic for real-world applications. This research enriches the ongoing discourse on optimizing language models, providing insights into the potency of soft prompts and summarization techniques as pivotal instruments for the forthcoming generation of NLP solutions.
- [871] arXiv:2404.04998 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Weakly Supervised Deep Hyperspherical Quantization for Image RetrievalComments: In proceedings of AAAI 2021. Code and data are availableSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL .
- [872] arXiv:2404.05003 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Camera-Based Remote Physiology Sensing for Hundreds of Subjects Across Skin TonesComments: 11 pages, 5 figures, CHI24 Workshop PhysioCHISubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Remote photoplethysmography (rPPG) emerges as a promising method for non-invasive, convenient measurement of vital signs, utilizing the widespread presence of cameras. Despite advancements, existing datasets fall short in terms of size and diversity, limiting comprehensive evaluation under diverse conditions. This paper presents an in-depth analysis of the VitalVideo dataset, the largest real-world rPPG dataset to date, encompassing 893 subjects and 6 Fitzpatrick skin tones. Our experimentation with six unsupervised methods and three supervised models demonstrates that datasets comprising a few hundred subjects(i.e., 300 for UBFC-rPPG, 500 for PURE, and 700 for MMPD-Simple) are sufficient for effective rPPG model training. Our findings highlight the importance of diversity and consistency in skin tones for precise performance evaluation across different datasets.
- [873] arXiv:2404.05055 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Percentile Criterion Optimization in Offline Reinforcement LearningComments: Accepted at Neurips 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.
- [874] arXiv:2404.05061 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Automated Prediction of Breast Cancer Response to Neoadjuvant Chemotherapy from DWI DataComments: Accepted for presentation at the IEEE International Symposium on Biomedical Imaging (ISBI)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Effective surgical planning for breast cancer hinges on accurately predicting pathological complete response (pCR) to neoadjuvant chemotherapy (NAC). Diffusion-weighted MRI (DWI) and machine learning offer a non-invasive approach for early pCR assessment. However, most machine-learning models require manual tumor segmentation, a cumbersome and error-prone task. We propose a deep learning model employing "Size-Adaptive Lesion Weighting" for automatic DWI tumor segmentation to enhance pCR prediction accuracy. Despite histopathological changes during NAC complicating DWI image segmentation, our model demonstrates robust performance. Utilizing the BMMR2 challenge dataset, it matches human experts in pCR prediction pre-NAC with an area under the curve (AUC) of 0.76 vs. 0.796, and surpasses standard automated methods mid-NAC, with an AUC of 0.729 vs. 0.654 and 0.576. Our approach represents a significant advancement in automating breast cancer treatment planning, enabling more reliable pCR predictions without manual segmentation.
- [875] arXiv:2404.05086 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Note on LoRASubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: LoRA (Low-Rank Adaptation) has emerged as a preferred method for efficiently adapting Large Language Models (LLMs) with remarkable simplicity and efficacy. This note extends the original LoRA paper by offering new perspectives that were not initially discussed and presents a series of insights for deploying LoRA at scale. Without introducing new experiments, we aim to improve the understanding and application of LoRA.
- [876] arXiv:2404.05090 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model CollapseSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations.
- [877] arXiv:2404.05094 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Active Test-Time Adaptation: Theoretical Analyses and An AlgorithmSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings. Currently, most TTA methods can only deal with minor shifts and rely heavily on heuristic and empirical studies.
To advance TTA under domain shifts, we propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting.
We provide a learning theory analysis, demonstrating that incorporating limited labeled test instances enhances overall performances across test domains with a theoretical guarantee. We also present a sample entropy balancing for implementing ATTA while avoiding catastrophic forgetting (CF). We introduce a simple yet effective ATTA algorithm, known as SimATTA, using real-time sample selection techniques. Extensive experimental results confirm consistency with our theoretical analyses and show that the proposed ATTA method yields substantial performance improvements over TTA methods while maintaining efficiency and shares similar effectiveness to the more demanding active domain adaptation (ADA) methods. Our code is available at this https URL - [878] arXiv:2404.05101 (cross-list from q-fin.CP) [ pdf , ps , html , other ]
-
Title: StockGPT: A GenAI Model for Stock Prediction and TradingComments: 19 pages, 3 figures, 7 tablesSubjects: Computational Finance (q-fin.CP) ; Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM); Pricing of Securities (q-fin.PR); Statistical Finance (q-fin.ST)
Abstract: This paper introduces StockGPT, an autoregressive ``number'' model trained and tested on 70 million daily U.S. stock returns over nearly 100 years. Treating each return series as a sequence of tokens, StockGPT automatically learns the hidden patterns predictive of future returns via its attention mechanism. On a held-out test sample from 2001 to 2023, a daily rebalanced long-short portfolio formed from StockGPT predictions earns an annual return of 119% with a Sharpe ratio of 6.5. The StockGPT-based portfolio completely spans momentum and long-/short-term reversals, eliminating the need for manually crafted price-based strategies, and also encompasses most leading stock market factors. This highlights the immense promise of generative AI in surpassing human in making complex financial investment decisions.
- [879] arXiv:2404.05136 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Self-Supervised Multi-Object Tracking with Path ConsistencyComments: Accepted at CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision. Our key idea is that, to track a object through frames, we can obtain multiple different association results from a model by varying the frames it can observe, i.e., skipping frames in observation. As the differences in observations do not alter the identities of objects, the obtained association results should be consistent. Based on this rationale, we generate multiple observation paths, each specifying a different set of frames to be skipped, and formulate the Path Consistency Loss that enforces the association results are consistent across different observation paths. We use the proposed loss to train our object matching model with only self-supervision. By extensive experiments on three tracking datasets (MOT17, PersonPath22, KITTI), we demonstrate that our method outperforms existing unsupervised methods with consistent margins on various evaluation metrics, and even achieves performance close to supervised methods.
- [880] arXiv:2404.05143 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Plug and Play with Prompts: A Prompt Tuning Approach for Controlling Text GenerationComments: 9 pages, 3 figures, Presented at Deployable AI Workshop at AAAI-2024Journal-ref: Presented at Deployable AI Workshop at AAAI-2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Transformer-based Large Language Models (LLMs) have shown exceptional language generation capabilities in response to text-based prompts. However, controlling the direction of generation via textual prompts has been challenging, especially with smaller models. In this work, we explore the use of Prompt Tuning to achieve controlled language generation. Generated text is steered using prompt embeddings, which are trained using a small language model, used as a discriminator. Moreover, we demonstrate that these prompt embeddings can be trained with a very small dataset, with as low as a few hundred training examples. Our method thus offers a data and parameter efficient solution towards controlling language model outputs. We carry out extensive evaluation on four datasets: SST-5 and Yelp (sentiment analysis), GYAFC (formality) and JIGSAW (toxic language). Finally, we demonstrate the efficacy of our method towards mitigating harmful, toxic, and biased text generated by language models.
- [881] arXiv:2404.05180 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GloSoFarID: Global multispectral dataset for Solar Farm IDentification in satellite imagerySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Solar Photovoltaic (PV) technology is increasingly recognized as a pivotal solution in the global pursuit of clean and renewable energy. This technology addresses the urgent need for sustainable energy alternatives by converting solar power into electricity without greenhouse gas emissions. It not only curtails global carbon emissions but also reduces reliance on finite, non-renewable energy sources. In this context, monitoring solar panel farms becomes essential for understanding and facilitating the worldwide shift toward clean energy. This study contributes to this effort by developing the first comprehensive global dataset of multispectral satellite imagery of solar panel farms. This dataset is intended to form the basis for training robust machine learning models, which can accurately map and analyze the expansion and distribution of solar panel farms globally. The insights gained from this endeavor will be instrumental in guiding informed decision-making for a sustainable energy future. this https URL
- [882] arXiv:2404.05182 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DLoRA: Distributed Parameter-Efficient Fine-Tuning Solution for Large Language ModelSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: To enhance the performance of large language models (LLM) on downstream tasks, one solution is to fine-tune certain LLM parameters and make it better align with the characteristics of the training dataset. This process is commonly known as parameter-efficient fine-tuning (PEFT). Due to the scale of LLM, PEFT operations are usually executed in the public environment (e.g., cloud server). This necessitates the sharing of sensitive user data across public environments, thereby raising potential privacy concerns. To tackle these challenges, we propose a distributed PEFT framework called DLoRA. DLoRA enables scalable PEFT operations to be performed collaboratively between the cloud and user devices. Coupled with the proposed Kill and Revive algorithm, the evaluation results demonstrate that DLoRA can significantly reduce the computation and communication workload over the user devices while achieving superior accuracy and privacy protection.
- [883] arXiv:2404.05188 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model MergingTianshuo Cong , Delong Ran , Zesen Liu , Xinlei He , Jinyuan Liu , Yichen Gong , Qi Li , Anyu Wang , Xiaoyun WangComments: Technical ReportSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Model merging is a promising lightweight model empowerment technique that does not rely on expensive computing devices (e.g., GPUs) or require the collection of specific training data. Instead, it involves editing different upstream model parameters to absorb their downstream task capabilities. However, uncertified model merging can infringe upon the Intellectual Property (IP) rights of the original upstream models. In this paper, we conduct the first study on the robustness of IP protection methods in model merging scenarios. We investigate two state-of-the-art IP protection techniques: Quantization Watermarking and Instructional Fingerprint, along with various advanced model merging technologies, such as Task Arithmetic, TIES-MERGING, and so on. Experimental results indicate that current Large Language Model (LLM) watermarking techniques cannot survive in the merged models, whereas model fingerprinting techniques can. Our research aims to highlight that model merging should be an indispensable consideration in the robustness assessment of model IP protection techniques, thereby promoting the healthy development of the open-source LLM community.
- [884] arXiv:2404.05213 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI ResearchSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: There is increasing interest in the adoption of LLMs in HCI research. However, LLMs may often be regarded as a panacea because of their powerful capabilities with an accompanying oversight on whether they are suitable for their intended tasks. We contend that LLMs should be adopted in a critical manner following rigorous evaluation. Accordingly, we present the evaluation of an LLM in identifying logical fallacies that will form part of a digital misinformation intervention. By comparing to a labeled dataset, we found that GPT-4 achieves an accuracy of 0.79, and for our intended use case that excludes invalid or unidentified instances, an accuracy of 0.90. This gives us the confidence to proceed with the application of the LLM while keeping in mind the areas where it still falls short. The paper describes our evaluation approach, results and reflections on the use of the LLM for our intended task.
- [885] arXiv:2404.05218 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory ConditioningComments: 2024 CVPR HighlightSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Human pose forecasting garners attention for its diverse applications. However, challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist, particularly with longer timescales and more agents. In this paper, we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model, utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted, followed by respective local pose forecasts conditioned on each mode. In doing so, our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions, improving performance in complex environments. Furthermore, we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations, enabling a comprehensive evaluation of our proposed model. State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method. The code is available at this https URL .
- [886] arXiv:2404.05221 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language ModelsShibo Hao , Yi Gu , Haotian Luo , Tianyang Liu , Xiyan Shao , Xinyuan Wang , Shuhua Xie , Haodi Ma , Adithya Samavedhi , Qiyue Gao , Zhen Wang , Zhiting HuComments: Project website: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Generating accurate step-by-step reasoning is essential for Large Language Models (LLMs) to address complex problems and enhance robustness and interpretability. Despite the flux of research on developing advanced reasoning approaches, systematically analyzing the diverse LLMs and reasoning strategies in generating reasoning chains remains a significant challenge. The difficulties stem from the lack of two key elements: (1) an automatic method for evaluating the generated reasoning chains on different tasks, and (2) a unified formalism and implementation of the diverse reasoning approaches for systematic comparison. This paper aims to close the gap: (1) We introduce AutoRace for fully automated reasoning chain evaluation. Existing metrics rely on expensive human annotations or pre-defined LLM prompts not adaptable to different tasks. In contrast, AutoRace automatically creates detailed evaluation criteria tailored for each task, and uses GPT-4 for accurate evaluation following the criteria. (2) We develop LLM Reasoners, a library for standardized modular implementation of existing and new reasoning algorithms, under a unified formulation of the search, reward, and world model components. With the new evaluation and library, (3) we conduct extensive study of different reasoning approaches (e.g., CoT, ToT, RAP). The analysis reveals interesting findings about different factors contributing to reasoning, including the reward-guidance, breadth-vs-depth in search, world model, and prompt formats, etc.
- [887] arXiv:2404.05243 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Product Description and QA Assisted Self-Supervised Opinion SummarizationTejpalsingh Siledar , Rupasai Rangaraju , Sankara Sri Raghava Ravindra Muddu , Suman Banerjee , Amey Patil , Sudhanshu Shekhar Singh , Muthusamy Chelliah , Nikesh Garera , Swaprava Nath , Pushpak BhattacharyyaSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In e-commerce, opinion summarization is the process of summarizing the consensus opinions found in product reviews. However, the potential of additional sources such as product description and question-answers (QA) has been considered less often. Moreover, the absence of any supervised training data makes this task challenging. To address this, we propose a novel synthetic dataset creation (SDC) strategy that leverages information from reviews as well as additional sources for selecting one of the reviews as a pseudo-summary to enable supervised training. Our Multi-Encoder Decoder framework for Opinion Summarization (MEDOS) employs a separate encoder for each source, enabling effective selection of information while generating the summary. For evaluation, due to the unavailability of test sets with additional sources, we extend the Amazon, Oposum+, and Flipkart test sets and leverage ChatGPT to annotate summaries. Experiments across nine test sets demonstrate that the combination of our SDC approach and MEDOS model achieves on average a 14.5% improvement in ROUGE-1 F1 over the SOTA. Moreover, comparative analysis underlines the significance of incorporating additional sources for generating more informative summaries. Human evaluations further indicate that MEDOS scores relatively higher in coherence and fluency with 0.41 and 0.5 (-1 to 1) respectively, compared to existing models. To the best of our knowledge, we are the first to generate opinion summaries leveraging additional sources in a self-supervised setting.
- [888] arXiv:2404.05256 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Text-to-Image Synthesis for Any Artistic Styles: Advancements in Personalized Artistic Image Generation via Subdivision and Dual BindingComments: 20 pages, 12 figuersSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in text-to-image models, such as Stable Diffusion, have demonstrated their ability to synthesize visual images through natural language prompts. One approach of personalizing text-to-image models, exemplified by DreamBooth, fine-tunes the pre-trained model by binding unique text identifiers with a few images of a specific subject. Although existing fine-tuning methods have demonstrated competence in rendering images according to the styles of famous painters, it is still challenging to learn to produce images encapsulating distinct art styles due to abstract and broad visual perceptions of stylistic attributes such as lines, shapes, textures, and colors. In this paper, we introduce a new method, Single-StyleForge, for personalization. It fine-tunes pre-trained text-to-image diffusion models to generate diverse images in specified styles from text prompts. By using around 15-20 images of the target style, the approach establishes a foundational binding of a unique token identifier with a broad range of the target style. It also utilizes auxiliary images to strengthen this binding, resulting in offering specific guidance on representing elements such as persons in a target style-consistent manner. In addition, we present ways to improve the quality of style and text-image alignment through a method called Multi-StyleForge, which inherits the strategy used in StyleForge and learns tokens in multiple. Experimental evaluation conducted on six distinct artistic styles demonstrates substantial improvements in both the quality of generated images and the perceptual fidelity metrics, such as FID, KID, and CLIP scores.
- [889] arXiv:2404.05290 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MindSet: Vision. A toolbox for testing DNNs on key psychological experimentsValerio Biscione , Dong Yin , Gaurav Malhotra , Marin Dujmovic , Milton L. Montero , Guillermo Puebla , Federico Adolfi , Rachel F. Heaton , John E. Hummel , Benjamin D. Evans , Karim Habashy , Jeffrey S. BowersSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbox MindSet: Vision, consisting of a collection of image datasets and related scripts designed to test DNNs on 30 psychological findings. In all experimental conditions, the stimuli are systematically manipulated to test specific hypotheses regarding human visual perception and object recognition. In addition to providing pre-generated datasets of images, we provide code to regenerate these datasets, offering many configurable parameters which greatly extend the dataset versatility for different research contexts, and code to facilitate the testing of DNNs on these image datasets using three different methods (similarity judgments, out-of-distribution classification, and decoder method), accessible at this https URL . We test ResNet-152 on each of these methods as an example of how the toolbox can be used.
- [890] arXiv:2404.05324 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Back to the Future: GNN-based NO$_2$ Forecasting via Future CovariatesAntonio Giganti , Sara Mandelli , Paolo Bestagini , Umberto Giuriato , Alessandro D'Ausilio , Marco Marcon , Stefano TubaroComments: 5 pages, 4 figures, 1 table, accepted at IEEE-IGARSS 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Due to the latest environmental concerns in keeping at bay contaminants emissions in urban areas, air pollution forecasting has been rising the forefront of all researchers around the world. When predicting pollutant concentrations, it is common to include the effects of environmental factors that influence these concentrations within an extended period, like traffic, meteorological conditions and geographical information. Most of the existing approaches exploit this information as past covariates, i.e., past exogenous variables that affected the pollutant but were not affected by it. In this paper, we present a novel forecasting methodology to predict NO$_2$ concentration via both past and future covariates. Future covariates are represented by weather forecasts and future calendar events, which are already known at prediction time. In particular, we deal with air quality observations in a city-wide network of ground monitoring stations, modeling the data structure and estimating the predictions with a Spatiotemporal Graph Neural Network (STGNN). We propose a conditioning block that embeds past and future covariates into the current observations. After extracting meaningful spatiotemporal representations, these are fused together and projected into the forecasting horizon to generate the final prediction. To the best of our knowledge, it is the first time that future covariates are included in time series predictions in a structured way. Remarkably, we find that conditioning on future weather information has a greater impact than considering past traffic conditions. We release our code implementation at this https URL .
- [891] arXiv:2404.05337 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Objectively Benchmarking Social Intelligence for Language Agents at Action LevelSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Prominent large language models have exhibited human-level performance in many domains, even enabling the derived agents to simulate human and social interactions. While practical works have substantiated the practicability of grounding language agents in sandbox simulation or embodied simulators, current social intelligence benchmarks either stay at the language level or use subjective metrics. In pursuit of a more realistic and objective evaluation, we introduce the Social Tasks in Sandbox Simulation (STSS) benchmark, which assesses language agents \textbf{objectively} at the \textbf{action level} by scrutinizing the goal achievements within the multi-agent simulation. Additionally, we sample conversation scenarios to build a language-level benchmark to provide an economically prudent preliminary evaluation and align with prevailing benchmarks. To gauge the significance of agent architecture, we implement a target-driven planning (TDP) module as an adjunct to the existing agent. Our evaluative findings highlight that the STSS benchmark is challenging for state-of-the-art language agents. Furthermore, it effectively discriminates between distinct language agents, suggesting its usefulness as a benchmark for evaluating both language models and agent architectures.
- [892] arXiv:2404.05348 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Iterative Refinement Strategy for Automated Data Labeling: Facial Landmark Diagnosis in Medical ImagingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automated data labeling techniques are crucial for accelerating the development of deep learning models, particularly in complex medical imaging applications. However, ensuring accuracy and efficiency remains challenging. This paper presents iterative refinement strategies for automated data labeling in facial landmark diagnosis to enhance accuracy and efficiency for deep learning models in medical applications, including dermatology, plastic surgery, and ophthalmology. Leveraging feedback mechanisms and advanced algorithms, our approach iteratively refines initial labels, reducing reliance on manual intervention while improving label quality. Through empirical evaluation and case studies, we demonstrate the effectiveness of our proposed strategies in deep learning tasks across medical imaging domains. Our results highlight the importance of iterative refinement in automated data labeling to enhance the capabilities of deep learning systems in medical imaging applications.
- [893] arXiv:2404.05384 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking the Spatial Inconsistency in Classifier-Free Diffusion GuidanceComments: accepted by CVPR-2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at this https URL .
- [894] arXiv:2404.05393 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PAT: Pixel-wise Adaptive Training for Long-tailed SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Beyond class frequency, we recognize the impact of class-wise relationships among various class-specific predictions and the imbalance in label masks on long-tailed segmentation learning. To address these challenges, we propose an innovative Pixel-wise Adaptive Training (PAT) technique tailored for long-tailed segmentation. PAT has two key features: 1) class-wise gradient magnitude homogenization, and 2) pixel-wise class-specific loss adaptation (PCLA). First, the class-wise gradient magnitude homogenization helps alleviate the imbalance among label masks by ensuring equal consideration of the class-wise impact on model updates. Second, PCLA tackles the detrimental impact of both rare classes within the long-tailed distribution and inaccurate predictions from previous training stages by encouraging learning classes with low prediction confidence and guarding against forgetting classes with high confidence. This combined approach fosters robust learning while preventing the model from forgetting previously learned knowledge. PAT exhibits significant performance improvements, surpassing the current state-of-the-art by 2.2% in the NyU dataset. Moreover, it enhances overall pixel-wise accuracy by 2.85% and intersection over union value by 2.07%, with a particularly notable declination of 0.39% in detecting rare classes compared to Balance Logits Variation, as demonstrated on the three popular datasets, i.e., OxfordPetIII, CityScape, and NYU.
- [895] arXiv:2404.05399 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model SafetySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The last two years have seen a rapid growth in concerns around the safety of large language models (LLMs). Researchers and practitioners have met these concerns by introducing an abundance of new datasets for evaluating and improving LLM safety. However, much of this work has happened in parallel, and with very different goals in mind, ranging from the mitigation of near-term risks around bias and toxic content generation to the assessment of longer-term catastrophic risk potential. This makes it difficult for researchers and practitioners to find the most relevant datasets for a given use case, and to identify gaps in dataset coverage that future work may fill. To remedy these issues, we conduct a first systematic review of open datasets for evaluating and improving LLM safety. We review 102 datasets, which we identified through an iterative and community-driven process over the course of several months. We highlight patterns and trends, such as a a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English datasets. We also examine how LLM safety datasets are used in practice -- in LLM release publications and popular LLM benchmarks -- finding that current evaluation practices are highly idiosyncratic and make use of only a small fraction of available datasets. Our contributions are based on this http URL , a living catalogue of open datasets for LLM safety, which we commit to updating continuously as the field of LLM safety develops.
- [896] arXiv:2404.05403 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: SoK: Gradient Leakage in Federated LearningSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning (FL) enables collaborative model training among multiple clients without raw data exposure. However, recent studies have shown that clients' private training data can be reconstructed from the gradients they share in FL, known as gradient inversion attacks (GIAs). While GIAs have demonstrated effectiveness under \emph{ideal settings and auxiliary assumptions}, their actual efficacy against \emph{practical FL systems} remains under-explored. To address this gap, we conduct a comprehensive study on GIAs in this work. We start with a survey of GIAs that establishes a milestone to trace their evolution and develops a systematization to uncover their inherent threats. Specifically, we categorize the auxiliary assumptions used by existing GIAs based on their practical accessibility to potential adversaries. To facilitate deeper analysis, we highlight the challenges that GIAs face in practical FL systems from three perspectives: \textit{local training}, \textit{model}, and \textit{post-processing}. We then perform extensive theoretical and empirical evaluations of state-of-the-art GIAs across diverse settings, utilizing eight datasets and thirteen models. Our findings indicate that GIAs have inherent limitations when reconstructing data under practical local training settings. Furthermore, their efficacy is sensitive to the trained model, and even simple post-processing measures applied to gradients can be effective defenses. Overall, our work provides crucial insights into the limited effectiveness of GIAs in practical FL systems. By rectifying prior misconceptions, we hope to inspire more accurate and realistic investigations on this topic.
- [897] arXiv:2404.05405 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Physics of Language Models: Part 3.3, Knowledge Capacity Scaling LawsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation.
More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity. Notable insights include:
* The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train.
* Prepending training data with domain names (e.g., this http URL ) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity. - [898] arXiv:2404.05406 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: PerkwE_COQA: Enhanced Persian Conversational Question Answering by combining contextual keyword extraction with Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Smart cities need the involvement of their residents to enhance quality of life. Conversational query-answering is an emerging approach for user engagement. There is an increasing demand of an advanced conversational question-answering that goes beyond classic systems. Existing approaches have shown that LLMs offer promising capabilities for CQA, but may struggle to capture the nuances of conversational contexts. The new approach involves understanding the content and engaging in a multi-step conversation with the user to fulfill their needs. This paper presents a novel method to elevate the performance of Persian Conversational question-answering (CQA) systems. It combines the strengths of Large Language Models (LLMs) with contextual keyword extraction. Our method extracts keywords specific to the conversational flow, providing the LLM with additional context to understand the user's intent and generate more relevant and coherent responses. We evaluated the effectiveness of this combined approach through various metrics, demonstrating significant improvements in CQA performance compared to an LLM-only baseline. The proposed method effectively handles implicit questions, delivers contextually relevant answers, and tackles complex questions that rely heavily on conversational context. The findings indicate that our method outperformed the evaluation benchmarks up to 8% higher than existing methods and the LLM-only baseline.
- [899] arXiv:2404.05415 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point LocationsYiming Li , Xueqing Peng , Jianfu Li , Xu Zuo , Suyuan Peng , Donghong Pei , Cui Tao , Hua Xu , Na HongSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In acupuncture therapy, the accurate location of acupoints is essential for its effectiveness. The advanced language understanding capabilities of large language models (LLMs) like Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations from textual knowledge sources. This study aims to compare the performance of GPT with traditional deep learning models (Long Short-Term Memory (LSTM) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT)) in extracting acupoint-related location relations and assess the impact of pretraining and fine-tuning on GPT's performance. We utilized the World Health Organization Standard Acupuncture Point Locations in the Western Pacific Region (WHO Standard) as our corpus, which consists of descriptions of 361 acupoints. Five types of relations ('direction_of,' 'distance_of,' 'part_of,' 'near_acupoint,' and 'located_near') (n= 3,174) between acupoints were annotated. Five models were compared: BioBERT, LSTM, pre-trained GPT-3.5, fine-tuned GPT-3.5, as well as pre-trained GPT-4. Performance metrics included micro-average exact match precision, recall, and F1 scores. Our results demonstrate that fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types. Overall, it achieved the highest micro-average F1 score of 0.92. This study underscores the effectiveness of LLMs like GPT in extracting relations related to acupoint locations, with implications for accurately modeling acupuncture knowledge and promoting standard implementation in acupuncture training and practice. The findings also contribute to advancing informatics applications in traditional and complementary medicine, showcasing the potential of LLMs in natural language processing.
- [900] arXiv:2404.05417 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Indexing Analytics to Instances: How Integrating a Dashboard can Support Design EducationAjit Jain , Andruid Kerne , Nic Lupfer , Gabriel Britain , Aaron Perrine , Yoonsuck Choe , John Keyser , Ruihong Huang , Jinsil Seo , Annie Sungkajun , Robert Lightfoot , Timothy McGuireComments: 22 pages, 4 figures, Submitted to ACM DISSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: We investigate how to use AI-based analytics to support design education. The analytics at hand measure multiscale design, that is, students' use of space and scale to visually and conceptually organize their design work. With the goal of making the analytics intelligible to instructors, we developed a research artifact integrating a design analytics dashboard with design instances, and the design environment that students use to create them. We theorize about how Suchman's notion of mutual intelligibility requires contextualized investigation of AI in order to develop findings about how analytics work for people. We studied the research artifact in 5 situated course contexts, in 3 departments. A total of 236 students used the multiscale design environment. The 9 instructors who taught those students experienced the analytics via the new research artifact.
We derive findings from a qualitative analysis of interviews with instructors regarding their experiences. Instructors reflected on how the analytics and their presentation in the dashboard have the potential to affect design education. We develop research implications addressing: (1) how indexing design analytics in the dashboard to actual design work instances helps design instructors reflect on what they mean and, more broadly, is a technique for how AI-based design analytics can support instructors' assessment and feedback experiences in situated course contexts; and (2) how multiscale design analytics, in particular, have the potential to support design education. By indexing, we mean linking which provides context, here connecting the numbers of the analytics with visually annotated design work instances. - [901] arXiv:2404.05423 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Residual Chain Prediction for Autonomous Driving Path PlanningComments: 6 pages, 2 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: In the rapidly evolving field of autonomous driving systems, the refinement of path planning algorithms is paramount for navigating vehicles through dynamic environments, particularly in complex urban scenarios. Traditional path planning algorithms, which are heavily reliant on static rules and manually defined parameters, often fall short in such contexts, highlighting the need for more adaptive, learning-based approaches. Among these, behavior cloning emerges as a noteworthy strategy for its simplicity and efficiency, especially within the realm of end-to-end path planning. However, behavior cloning faces challenges, such as covariate shift when employing traditional Manhattan distance as the metric. Addressing this, our study introduces the novel concept of Residual Chain Loss. Residual Chain Loss dynamically adjusts the loss calculation process to enhance the temporal dependency and accuracy of predicted path points, significantly improving the model's performance without additional computational overhead. Through testing on the nuScenes dataset, we underscore the method's substantial advancements in addressing covariate shift, facilitating dynamic loss adjustments, and ensuring seamless integration with end-to-end path planning frameworks. Our findings highlight the potential of Residual Chain Loss to revolutionize planning component of autonomous driving systems, marking a significant step forward in the quest for level 5 autonomous driving system.
- [902] arXiv:2404.05427 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: AutoCodeRover: Autonomous Program ImprovementSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. bug fixing) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving GitHub issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM's understanding of the issue's root cause, and effectively retrieve a context via iterative search. The use of spectrum based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on SWE-bench-lite which consists of 300 real-life GitHub issues show increased efficacy in solving GitHub issues (22-23% on SWE-bench-lite). On the full SWE-bench consisting of 2294 GitHub issues, AutoCodeRover solved around 16% of issues, which is higher than the efficacy of the recently reported AI software engineer Devin from Cognition Labs, while taking time comparable to Devin. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.
- [903] arXiv:2404.05458 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: Teaching Higher-Order Logic Using IsabelleSimon Tobias Lund (Technical University of Denmark), Jørgen Villadsen (Technical University of Denmark)Comments: In Proceedings ThEdu'23, arXiv:2404.03709Journal-ref: EPTCS 400, 2024, pp. 59-78Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI)
Abstract: We present a formalization of higher-order logic in the Isabelle proof assistant, building directly on the foundational framework Isabelle/Pure and developed to be as small and readable as possible. It should therefore serve as a good introduction for someone looking into learning about higher-order logic and proof assistants, without having to study the much more complex Isabelle/HOL with heavier automation. To showcase our development and approach we explain a sample proof, describe the axioms and rules of our higher-order logic, and discuss our experience with teaching the subject in a classroom setting.
- [904] arXiv:2404.05483 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?Comments: 8 pages, 3 figures, 5 tables, to be published in the Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), for associated code, see this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.
- [905] arXiv:2404.05499 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: Guiding Large Language Models to Generate Computer-Parsable ContentComments: 44 pages, 39 figures, 8 tables, Chinese version: this https URLSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: We propose a method to guide Large Language Models (LLMs) in generating structured content adhering to specific conventions without fine-tuning. By utilizing coroutine-based content generation constraints through a pre-agreed context-free grammar (CFG), LLMs are directed during decoding to produce formal language compliant outputs. This enhances stability and consistency in generating target data structures, types, or instructions, reducing application development complexities. Experimentally, error rates of GPT-2 and Gemma exceed 95% for DSLs longer than 36 and 282 tokens, respectively. We introduce YieldLang, a coroutine-based DSL generation framework, and evaluate it with LLMs on various tasks including JSON and Mermaid flowchart generation. Compared to benchmarks, our approach improves accuracy by 1.09 to 11.6 times, with LLMs requiring only about 16.5% of the samples to generate JSON effectively. This enhances usability of LLM-generated content for computer programs.
- [906] arXiv:2404.05501 (cross-list from q-bio.NC) [ pdf , ps , other ]
-
Title: Data Science In OlfactionComments: 20 pages, 10 Figures, 2 Appendix, 1 TableSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Advances in neural sensing technology are making it possible to observe the olfactory process in great detail. In this paper, we conceptualize smell from a Data Science and AI perspective, that relates the properties of odorants to how they are sensed and analyzed in the olfactory system from the nose to the brain. Drawing distinctions to color vision, we argue that smell presents unique measurement challenges, including the complexity of stimuli, the high dimensionality of the sensory apparatus, as well as what constitutes ground truth. In the face of these challenges, we argue for the centrality of odorant-receptor interactions in developing a theory of olfaction. Such a theory is likely to find widespread industrial applications, and enhance our understanding of smell, and in the longer-term, how it relates to other senses and language. As an initial use case of the data, we present results using machine learning-based classification of neural responses to odors as they are recorded in the mouse olfactory bulb with calcium imaging.
- [907] arXiv:2404.05502 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PetKaz at SemEval-2024 Task 3: Advancing Emotion Classification with an LLM for Emotion-Cause Pair Extraction in ConversationsComments: 8 pages, 7 figures, 2 tables, to be published in the Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), for associated code, see this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we present our submission to the SemEval-2023 Task~3 "The Competition of Multimodal Emotion Cause Analysis in Conversations", focusing on extracting emotion-cause pairs from dialogs. Specifically, our approach relies on combining fine-tuned GPT-3.5 for emotion classification and a BiLSTM-based neural network to detect causes. We score 2nd in the ranking for Subtask 1, demonstrating the effectiveness of our approach through one of the highest weighted-average proportional F1 scores recorded at 0.264.
- [908] arXiv:2404.05508 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Synergy of Large Language Model and Model Driven Engineering for Automated Development of Centralized Vehicular SystemsNenad Petrovic , Fengjunjie Pan , Krzysztof Lebioda , Vahid Zolfaghari , Sven Kirchner , Nils Purschke , Muhammad Aqib Khan , Viktor Vorobev , Alois KnollSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We present a prototype of a tool leveraging the synergy of model driven engineering (MDE) and Large Language Models (LLM) for the purpose of software development process automation in the automotive industry. In this approach, the user-provided input is free form textual requirements, which are first translated to Ecore model instance representation using an LLM, which is afterwards checked for consistency using Object Constraint Language (OCL) rules. After successful consistency check, the model instance is fed as input to another LLM for the purpose of code generation. The generated code is evaluated in a simulated environment using CARLA simulator connected to an example centralized vehicle architecture, in an emergency brake scenario.
- [909] arXiv:2404.05530 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training, and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous preference pairs into these datasets and the RLHF training process. We propose strategies to build poisonous preference pairs and test their performance by poisoning two widely used preference datasets. Our results show that preference poisoning is highly effective: by injecting a small amount of poisonous data (1-5% of the original dataset), we can effectively manipulate the LM to generate a target entity in a target sentiment (positive or negative). The findings from our experiments also shed light on strategies to defend against the preference poisoning attack.
- [910] arXiv:2404.05534 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Ordre public exceptions for algorithmic surveillance patentsComments: 14 pagesSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: This chapter explores the role of patent protection in algorithmic surveillance and whether ordre public exceptions from patentability should apply to such patents, due to their potential to enable human rights violations. It concludes that in most cases, it is undesirable to exclude algorithmic surveillance patents from patentability, as the patent system is ill-equipped to evaluate the impacts of the exploitation of such technologies. Furthermore, the disclosure of such patents has positive externalities from the societal perspective by opening the black box of surveillance for public scrutiny.
- [911] arXiv:2404.05540 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: OPSD: an Offensive Persian Social media Dataset and its baseline evaluationsMehran Safayani , Amir Sartipi , Amir Hossein Ahmadi , Parniyan Jalali , Amir Hossein Mansouri , Mohammad Bisheh-Niasar , Zahra PourbahmanComments: 16 pages, 5 figures, 8 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The proliferation of hate speech and offensive comments on social media has become increasingly prevalent due to user activities. Such comments can have detrimental effects on individuals' psychological well-being and social behavior. While numerous datasets in the English language exist in this domain, few equivalent resources are available for Persian language. To address this gap, this paper introduces two offensive datasets. The first dataset comprises annotations provided by domain experts, while the second consists of a large collection of unlabeled data obtained through web crawling for unsupervised learning purposes. To ensure the quality of the former dataset, a meticulous three-stage labeling process was conducted, and kappa measures were computed to assess inter-annotator agreement. Furthermore, experiments were performed on the dataset using state-of-the-art language models, both with and without employing masked language modeling techniques, as well as machine learning algorithms, in order to establish the baselines for the dataset using contemporary cutting-edge approaches. The obtained F1-scores for the three-class and two-class versions of the dataset were 76.9% and 89.9% for XLM-RoBERTa, respectively.
- [912] arXiv:2404.05545 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Evaluating Interventional Reasoning Capabilities of Large Language ModelsComments: 17 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
Abstract: Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts.
- [913] arXiv:2404.05553 (cross-list from q-bio.NC) [ pdf , ps , other ]
-
Title: Alljoined1 -- A dataset for EEG-to-Image decodingJonathan Xu , Bruno Aristimunha , Max Emanuel Feucht , Emma Qian , Charles Liu , Tazik Shahjahan , Martyna Spyra , Steven Zifan Zhang , Nicholas Short , Jioh Kim , Paula Perdomo , Ricky Renfeng Mao , Yashvir Sabharwal , Michael Ahedor Moaz Shoura , Adrian NestorComments: 8 Pages, 6 FiguresSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI)
Abstract: We present Alljoined1, a dataset built specifically for EEG-to-Image decoding. Recognizing that an extensive and unbiased sampling of neural responses to visual stimuli is crucial for image reconstruction efforts, we collected data from 8 participants looking at 10,000 natural images each. We have currently gathered 46,080 epochs of brain responses recorded with a 64-channel EEG headset. The dataset combines response-based stimulus timing, repetition between blocks and sessions, and diverse image classes with the goal of improving signal quality. For transparency, we also provide data quality scores. We publicly release the dataset and all code at this https URL .
- [914] arXiv:2404.05555 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Convergence of Continual Learning with Adaptive MethodsComments: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI 2023), see https://proceedings.mlr.press/v216/han23a.htmlJournal-ref: PMLR 216:809-818, 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: One of the objectives of continual learning is to prevent catastrophic forgetting in learning multiple tasks sequentially, and the existing solutions have been driven by the conceptualization of the plasticity-stability dilemma. However, the convergence of continual learning for each sequential task is less studied so far. In this paper, we provide a convergence analysis of memory-based continual learning with stochastic gradient descent and empirical evidence that training current tasks causes the cumulative degradation of previous tasks. We propose an adaptive method for nonconvex continual learning (NCCL), which adjusts step sizes of both previous and current tasks with the gradients. The proposed method can achieve the same convergence rate as the SGD method when the catastrophic forgetting term which we define in the paper is suppressed at each iteration. Further, we demonstrate that the proposed algorithm improves the performance of continual learning over existing methods for several image classification tasks.
- [915] arXiv:2404.05567 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language ModelsBowen Pan , Yikang Shen , Haokun Liu , Mayank Mishra , Gaoyuan Zhang , Aude Oliva , Colin Raffel , Rameswar PandaSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.
- [916] arXiv:2404.05603 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Self-Explainable Affordance Learning with Embodied CaptionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.
- [917] arXiv:2404.05605 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference SystemsComments: Accepted by DAC'24Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The key to device-edge co-inference paradigm is to partition models into computation-friendly and computation-intensive parts across the device and the edge, respectively. However, for Graph Neural Networks (GNNs), we find that simply partitioning without altering their structures can hardly achieve the full potential of the co-inference paradigm due to various computational-communication overheads of GNN operations over heterogeneous devices. We present GCoDE, the first automatic framework for GNN that innovatively Co-designs the architecture search and the mapping of each operation on Device-Edge hierarchies. GCoDE abstracts the device communication process into an explicit operation and fuses the search of architecture and the operations mapping in a unified space for joint-optimization. Also, the performance-awareness approach, utilized in the constraint-based search process of GCoDE, enables effective evaluation of architecture efficiency in diverse heterogeneous systems. We implement the co-inference engine and runtime dispatcher in GCoDE to enhance the deployment efficiency. Experimental results show that GCoDE can achieve up to $44.9\times$ speedup and $98.2\%$ energy reduction compared to existing approaches across various applications and system configurations.
- [918] arXiv:2404.05613 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Representation Learning for Multi-functional Degradation Modeling of Community-dwelling Aging PopulationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: As the aging population grows, particularly for the baby boomer generation, the United States is witnessing a significant increase in the elderly population experiencing multifunctional disabilities. These disabilities, stemming from a variety of chronic diseases, injuries, and impairments, present a complex challenge due to their multidimensional nature, encompassing both physical and cognitive aspects. Traditional methods often use univariate regression-based methods to model and predict single degradation conditions and assume population homogeneity, which is inadequate to address the complexity and diversity of aging-related degradation. This study introduces a novel framework for multi-functional degradation modeling that captures the multidimensional (e.g., physical and cognitive) and heterogeneous nature of elderly disabilities. Utilizing deep learning, our approach predicts health degradation scores and uncovers latent heterogeneity from elderly health histories, offering both efficient estimation and explainable insights into the diverse effects and causes of aging-related degradation. A real-case study demonstrates the effectiveness and marks a pivotal contribution to accurately modeling the intricate dynamics of elderly degradation, and addresses the healthcare challenges in the aging population.
- [919] arXiv:2404.05624 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: LTNER: Large Language Model Tagging for Named Entity Recognition with Contextualized Entity MarkingComments: 13 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The use of LLMs for natural language processing has become a popular trend in the past two years, driven by their formidable capacity for context comprehension and learning, which has inspired a wave of research from academics and industry professionals. However, for certain NLP tasks, such as NER, the performance of LLMs still falls short when compared to supervised learning methods. In our research, we developed a NER processing framework called LTNER that incorporates a revolutionary Contextualized Entity Marking Gen Method. By leveraging the cost-effective GPT-3.5 coupled with context learning that does not require additional training, we significantly improved the accuracy of LLMs in handling NER tasks. The F1 score on the CoNLL03 dataset increased from the initial 85.9% to 91.9%, approaching the performance of supervised fine-tuning. This outcome has led to a deeper understanding of the potential of LLMs.
- [920] arXiv:2404.05639 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Investigating the Impact of Quantization on Adversarial RobustnessComments: Accepted to ICLR 2024 Workshop PML4LRSSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Quantization is a promising technique for reducing the bit-width of deep models to improve their runtime performance and storage efficiency, and thus becomes a fundamental step for deployment. In real-world scenarios, quantized models are often faced with adversarial attacks which cause the model to make incorrect inferences by introducing slight perturbations. However, recent studies have paid less attention to the impact of quantization on the model robustness. More surprisingly, existing studies on this topic even present inconsistent conclusions, which prompted our in-depth investigation. In this paper, we conduct a first-time analysis of the impact of the quantization pipeline components that can incorporate robust optimization under the settings of Post-Training Quantization and Quantization-Aware Training. Through our detailed analysis, we discovered that this inconsistency arises from the use of different pipelines in different studies, specifically regarding whether robust optimization is performed and at which quantization stage it occurs. Our research findings contribute insights into deploying more secure and robust quantized networks, assisting practitioners in reference for scenarios with high-security requirements and limited resources.
- [921] arXiv:2404.05648 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: Resistive Memory-based Neural Differential Equation Solver for Score-based Diffusion ModelJichang Yang , Hegan Chen , Jia Chen , Songqi Wang , Shaocong Wang , Yifei Yu , Xi Chen , Bo Wang , Xinyuan Zhang , Binbin Cui , Yi Li , Ning Lin , Meng Xu , Yi Li , Xiaoxin Xu , Xiaojuan Qi , Zhongrui Wang , Xumeng Zhang , Dashan Shang , Han Wang , Qi Liu , Kwang-Ting Cheng , Ming LiuSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
Abstract: Human brains image complicated scenes when reading a novel. Replicating this imagination is one of the ultimate goals of AI-Generated Content (AIGC). However, current AIGC methods, such as score-based diffusion, are still deficient in terms of rapidity and efficiency. This deficiency is rooted in the difference between the brain and digital computers. Digital computers have physically separated storage and processing units, resulting in frequent data transfers during iterative calculations, incurring large time and energy overheads. This issue is further intensified by the conversion of inherently continuous and analog generation dynamics, which can be formulated by neural differential equations, into discrete and digital operations. Inspired by the brain, we propose a time-continuous and analog in-memory neural differential equation solver for score-based diffusion, employing emerging resistive memory. The integration of storage and computation within resistive memory synapses surmount the von Neumann bottleneck, benefiting the generative speed and energy efficiency. The closed-loop feedback integrator is time-continuous, analog, and compact, physically implementing an infinite-depth neural network. Moreover, the software-hardware co-design is intrinsically robust to analog noise. We experimentally validate our solution with 180 nm resistive memory in-memory computing macros. Demonstrating equivalent generative quality to the software baseline, our system achieved remarkable enhancements in generative speed for both unconditional and conditional generation tasks, by factors of 64.8 and 156.5, respectively. Moreover, it accomplished reductions in energy consumption by factors of 5.2 and 4.1. Our approach heralds a new horizon for hardware solutions in edge computing for generative AI applications.
- [922] arXiv:2404.05656 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Causality Extraction from Nuclear Licensee Event Reports Using a Hybrid FrameworkSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Industry-wide nuclear power plant operating experience is a critical source of raw data for performing parameter estimations in reliability and risk models. Much operating experience information pertains to failure events and is stored as reports containing unstructured data, such as narratives. Event reports are essential for understanding how failures are initiated and propagated, including the numerous causal relations involved. Causal relation extraction using deep learning represents a significant frontier in the field of natural language processing (NLP), and is crucial since it enables the interpretation of intricate narratives and connections contained within vast amounts of written information. This paper proposed a hybrid framework for causality detection and extraction from nuclear licensee event reports. The main contributions include: (1) we compiled an LER corpus with 20,129 text samples for causality analysis, (2) developed an interactive tool for labeling cause effect pairs, (3) built a deep-learning-based approach for causal relation detection, and (4) developed a knowledge based cause-effect extraction approach.
- [923] arXiv:2404.05659 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical DomainComments: LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Due to privacy restrictions, there's a shortage of publicly available speech recognition datasets in the medical domain. In this work, we present VietMed - a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world's largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country. Moreover, we release the first public large-scale pre-trained models for Vietnamese ASR, w2v2-Viet and XLSR-53-Viet, along with the first public large-scale fine-tuned models for medical ASR. Even without any medical data in unsupervised pre-training, our best pre-trained model XLSR-53-Viet generalizes very well to the medical domain by outperforming state-of-the-art XLSR-53, from 51.8% to 29.6% WER on test set (a relative reduction of more than 40%). All code, data and models are made publicly available here: this https URL .
- [924] arXiv:2404.05688 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: David and Goliath: An Empirical Evaluation of Attacks and Defenses for QNNs at the Deep EdgeSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: ML is shifting from the cloud to the edge. Edge computing reduces the surface exposing private data and enables reliable throughput guarantees in real-time applications. Of the panoply of devices deployed at the edge, resource-constrained MCUs, e.g., Arm Cortex-M, are more prevalent, orders of magnitude cheaper, and less power-hungry than application processors or GPUs. Thus, enabling intelligence at the deep edge is the zeitgeist, with researchers focusing on unveiling novel approaches to deploy ANNs on these constrained devices. Quantization is a well-established technique that has proved effective in enabling the deployment of neural networks on MCUs; however, it is still an open question to understand the robustness of QNNs in the face of adversarial examples.
To fill this gap, we empirically evaluate the effectiveness of attacks and defenses from (full-precision) ANNs on (constrained) QNNs. Our evaluation includes three QNNs targeting TinyML applications, ten attacks, and six defenses. With this study, we draw a set of interesting findings. First, quantization increases the point distance to the decision boundary and leads the gradient estimated by some attacks to explode or vanish. Second, quantization can act as a noise attenuator or amplifier, depending on the noise magnitude, and causes gradient misalignment. Regarding adversarial defenses, we conclude that input pre-processing defenses show impressive results on small perturbations; however, they fall short as the perturbation increases. At the same time, train-based defenses increase the average point distance to the decision boundary, which holds after quantization. However, we argue that train-based defenses still need to smooth the quantization-shift and gradient misalignment phenomenons to counteract adversarial example transferability to QNNs. All artifacts are open-sourced to enable independent validation of results. - [925] arXiv:2404.05689 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Automated discovery of symbolic laws governing skill acquisition from naturally occurring dataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Skill acquisition is a key area of research in cognitive psychology as it encompasses multiple psychological processes. The laws discovered under experimental paradigms are controversial and lack generalizability. This paper aims to unearth the laws of skill learning from large-scale training log data. A two-stage algorithm was developed to tackle the issues of unobservable cognitive states and algorithmic explosion in searching. Initially a deep learning model is employed to determine the learner's cognitive state and assess the feature importance. Subsequently, symbolic regression algorithms are utilized to parse the neural network model into algebraic equations. The experimental results of simulated data demonstrate that the proposed algorithm can accurately restore various preset laws within a certain range of noise, in continues feedback setting. Application of proposed method to Lumosity training data demonstrates superior performance compared to traditional and latest models in terms of fitness. The results indicate the discovery of two new forms of skill acquisition laws, while some previous findings have been reaffirmed.
- [926] arXiv:2404.05694 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Comprehensive Study on German Language Models for Clinical and Biomedical Text UnderstandingAhmad Idrissi-Yaghir , Amin Dada , Henning Schäfer , Kamyar Arzideh , Giulia Baldini , Jan Trienes , Max Hasin , Jeanette Bewersdorff , Cynthia S. Schmidt , Marie Bauer , Kaleb E. Smith , Jiang Bian , Yonghui Wu , Jörg Schlötterer , Torsten Zesch , Peter A. Horn , Christin Seifert , Felix Nensa , Jens Kleesiek , Christoph M. FriedrichComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent advances in natural language processing (NLP) can be largely attributed to the advent of pre-trained language models such as BERT and RoBERTa. While these models demonstrate remarkable performance on general datasets, they can struggle in specialized domains such as medicine, where unique domain-specific terminologies, domain-specific abbreviations, and varying document structures are common. This paper explores strategies for adapting these models to domain-specific requirements, primarily through continuous pre-training on domain-specific data. We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data. The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering. Our results suggest that models augmented by clinical and translation-based pre-training typically outperform general domain models in medical contexts. We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch. Furthermore, pre-training on clinical data or leveraging translated texts have proven to be reliable methods for domain adaptation in medical NLP tasks.
- [927] arXiv:2404.05695 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real TransferSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: Humanoid-Gym is an easy-to-use reinforcement learning (RL) framework based on Nvidia Isaac Gym, designed to train locomotion skills for humanoid robots, emphasizing zero-shot transfer from simulation to the real-world environment. Humanoid-Gym also integrates a sim-to-sim framework from Isaac Gym to Mujoco that allows users to verify the trained policies in different physical simulations to ensure the robustness and generalization of the policies. This framework is verified by RobotEra's XBot-S (1.2-meter tall humanoid robot) and XBot-L (1.65-meter tall humanoid robot) in a real-world environment with zero-shot sim-to-real transfer. The project website and source code can be found at: this https URL .
- [928] arXiv:2404.05717 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual EditingJing Gu , Yilin Wang , Nanxuan Zhao , Wei Xiong , Qing Liu , Zhifei Zhang , He Zhang , Jianming Zhang , HyunJoon Jung , Xin Eric WangComments: 18 pages, 16 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks. SwapAnything also achieves great performance on text-based swapping and tasks beyond swapping such as object insertion.
- [929] arXiv:2404.05720 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Language-Independent Representations Improve Zero-Shot SummarizationComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. After training on monolingual summarization, we perform zero-shot transfer to new languages or language pairs. We first show naively finetuned models are highly language-specific in both output behavior and internal representations, resulting in poor zero-shot performance. Next, we propose query-key (QK) finetuning to decouple task-specific knowledge from the pretrained language generation abilities. Then, after showing downsides of the standard adversarial language classifier, we propose a balanced variant that more directly enforces language-agnostic representations. Moreover, our qualitative analyses show removing source language identity correlates to zero-shot summarization performance. Our code is openly available.
- [930] arXiv:2404.05741 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural InnovationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
Abstract: Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the performance of larger models, but with a reduced cost of running them. In this thesis we explore the methods of model compression, and we empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression, as these layers prove to be redundant, whilst also being incredibly computationally expensive. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.
- [931] arXiv:2404.05746 (cross-list from physics.data-an) [ pdf , ps , html , other ]
-
Title: Causality for Earth Science -- A Review on Time-series and Spatiotemporal Causality MethodsSahara Ali , Uzma Hasan , Xingyan Li , Omar Faruque , Akila Sampath , Yiyi Huang , Md Osman Gani , Jianwu WangSubjects: Data Analysis, Statistics and Probability (physics.data-an) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
Abstract: This survey paper covers the breadth and depth of time-series and spatiotemporal causality methods, and their applications in Earth Science. More specifically, the paper presents an overview of causal discovery and causal inference, explains the underlying causal assumptions, and enlists evaluation techniques and key terminologies of the domain area. The paper elicits the various state-of-the-art methods introduced for time-series and spatiotemporal causal analysis along with their strengths and limitations. The paper further describes the existing applications of several methods for answering specific Earth Science questions such as extreme weather events, sea level rise, teleconnections etc. This survey paper can serve as a primer for Data Science researchers interested in data-driven causal study as we share a list of resources, such as Earth Science datasets (synthetic, simulated and observational data) and open source tools for causal analysis. It will equally benefit the Earth Science community interested in taking an AI-driven approach to study the causality of different dynamic and thermodynamic processes as we present the open challenges and opportunities in performing causality-based Earth Science study.
- [932] arXiv:2404.05758 (cross-list from physics.data-an) [ pdf , ps , other ]
-
Title: Implicit Assimilation of Sparse In Situ Data for Dense & Global Storm Surge ForecastingComments: Accepted at CVPR EarthVision 2024Subjects: Data Analysis, Statistics and Probability (physics.data-an) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP)
Abstract: Hurricanes and coastal floods are among the most disastrous natural hazards. Both are intimately related to storm surges, as their causes and effects, respectively. However, the short-term forecasting of storm surges has proven challenging, especially when targeting previously unseen locations or sites without tidal gauges. Furthermore, recent work improved short and medium-term weather forecasting but the handling of raw unassimilated data remains non-trivial. In this paper, we tackle both challenges and demonstrate that neural networks can implicitly assimilate sparse in situ tide gauge data with coarse ocean state reanalysis in order to forecast storm surges. We curate a global dataset to learn and validate the dense prediction of storm surges, building on preceding efforts. Other than prior work limited to known gauges, our approach extends to ungauged sites, paving the way for global storm surge forecasting.
- [933] arXiv:2404.05762 (cross-list from q-bio.QM) [ pdf , ps , other ]
-
Title: Evaluating the Effectiveness of Artificial Intelligence in Predicting Adverse Drug Reactions among Cancer Patients: A Systematic Review and Meta-AnalysisComments: Paper has been accepted at the IEEE Challenges and Innovations on TIC (IEEE I2CIT) International ConferenceSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Adverse drug reactions considerably impact patient outcomes and healthcare costs in cancer therapy. Using artificial intelligence to predict adverse drug reactions in real time could revolutionize oncology treatment. This study aims to assess the performance of artificial intelligence models in predicting adverse drug reactions in patients with cancer. This is the first systematic review and meta-analysis. Scopus, PubMed, IEEE Xplore, and ACM Digital Library databases were searched for studies in English, French, and Arabic from January 1, 2018, to August 20, 2023. The inclusion criteria were: (1) peer-reviewed research articles; (2) use of artificial intelligence algorithms (machine learning, deep learning, knowledge graphs); (3) study aimed to predict adverse drug reactions (cardiotoxicity, neutropenia, nephrotoxicity, hepatotoxicity); (4) study was on cancer patients. The data were extracted and evaluated by three reviewers for study quality. Of the 332 screened articles, 17 studies (5%) involving 93,248 oncology patients from 17 countries were included in the systematic review, of which ten studies synthesized the meta-analysis. A random-effects model was created to pool the sensitivity, specificity, and AUC of the included studies. The pooled results were 0.82 (95% CI:0.69, 0.9), 0.84 (95% CI:0.75, 0.9), and 0.83 (95% CI:0.77, 0.87) for sensitivity, specificity, and AUC, respectively, of ADR predictive models. Biomarkers proved their effectiveness in predicting ADRs, yet they were adopted by only half of the reviewed studies. The use of AI in cancer treatment shows great potential, with models demonstrating high specificity and sensitivity in predicting ADRs. However, standardized research and multicenter studies are needed to improve the quality of evidence. AI can enhance cancer patient care by bridging the gap between data-driven insights and clinical expertise.
- [934] arXiv:2404.05765 (cross-list from cs.SD) [ pdf , ps , other ]
-
Title: A Novel Bi-LSTM And Transformer Architecture For Generating Tabla MusicSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Introduction: Music generation is a complex task that has received significant attention in recent years, and deep learning techniques have shown promising results in this field. Objectives: While extensive work has been carried out on generating Piano and other Western music, there is limited research on generating classical Indian music due to the scarcity of Indian music in machine-encoded formats. In this technical paper, methods for generating classical Indian music, specifically tabla music, is proposed. Initially, this paper explores piano music generation using deep learning architectures. Then the fundamentals are extended to generating tabla music. Methods: Tabla music in waveform (.wav) files are pre-processed using the librosa library in Python. A novel Bi-LSTM with an Attention approach and a transformer model are trained on the extracted features and labels. Results: The models are then used to predict the next sequences of tabla music. A loss of 4.042 and MAE of 1.0814 are achieved with the Bi-LSTM model. With the transformer model, a loss of 55.9278 and MAE of 3.5173 are obtained for tabla music generation. Conclusion: The resulting music embodies a harmonious fusion of novelty and familiarity, pushing the limits of music composition to new horizons.
- [935] arXiv:2404.05767 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: CSA-Trans: Code Structure Aware Transformer for ASTSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: When applying the Transformer architecture to source code, designing a good self-attention mechanism is critical as it affects how node relationship is extracted from the Abstract Syntax Trees (ASTs) of the source code. We present Code Structure Aware Transformer (CSA-Trans), which uses Code Structure Embedder (CSE) to generate specific PE for each node in AST. CSE generates node Positional Encoding (PE) using disentangled attention. To further extend the self-attention capability, we adopt Stochastic Block Model (SBM) attention. Our evaluation shows that our PE captures the relationships between AST nodes better than other graph-related PE techniques. We also show through quantitative and qualitative analysis that SBM attention is able to generate more node specific attention coefficients. We demonstrate that CSA-Trans outperforms 14 baselines in code summarization tasks for both Python and Java, while being 41.92% faster and 25.31% memory efficient in Java dataset compared to AST-Trans and SG-Trans respectively.
- [936] arXiv:2404.05769 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Dynamic Quality-Diversity SearchComments: Parts of this manuscripts are published at The Genetic and Evolutionary Computation Conference (GECCO) 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Evolutionary search via the quality-diversity (QD) paradigm can discover highly performing solutions in different behavioural niches, showing considerable potential in complex real-world scenarios such as evolutionary robotics. Yet most QD methods only tackle static tasks that are fixed over time, which is rarely the case in the real world. Unlike noisy environments, where the fitness of an individual changes slightly at every evaluation, dynamic environments simulate tasks where external factors at unknown and irregular intervals alter the performance of the individual with a severity that is unknown a priori. Literature on optimisation in dynamic environments is extensive, yet such environments have not been explored in the context of QD search. This paper introduces a novel and generalisable Dynamic QD methodology that aims to keep the archive of past solutions updated in the case of environment changes. Secondly, we present a novel characterisation of dynamic environments that can be easily applied to well-known benchmarks, with minor interventions to move them from a static task to a dynamic one. Our Dynamic QD intervention is applied on MAP-Elites and CMA-ME, two powerful QD algorithms, and we test the dynamic variants on different dynamic tasks.
- [937] arXiv:2404.05774 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: STMGF: An Effective Spatial-Temporal Multi-Granularity Framework for Traffic ForecastingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Accurate Traffic Prediction is a challenging task in intelligent transportation due to the spatial-temporal aspects of road networks. The traffic of a road network can be affected by long-distance or long-term dependencies where existing methods fall short in modeling them. In this paper, we introduce a novel framework known as Spatial-Temporal Multi-Granularity Framework (STMGF) to enhance the capture of long-distance and long-term information of the road networks. STMGF makes full use of different granularity information of road networks and models the long-distance and long-term information by gathering information in a hierarchical interactive way. Further, it leverages the inherent periodicity in traffic sequences to refine prediction results by matching with recent traffic data. We conduct experiments on two real-world datasets, and the results demonstrate that STMGF outperforms all baseline models and achieves state-of-the-art performance.
- [938] arXiv:2404.05776 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Forecasting Electric Vehicle Battery Output Voltage: A Predictive Modeling ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: The battery management system plays a vital role in ensuring the safety and dependability of electric and hybrid vehicles. It is responsible for various functions, including state evaluation, monitoring, charge control, and cell balancing, all integrated within the BMS. Nonetheless, due to the uncertainties surrounding battery performance, implementing these functionalities poses significant challenges. In this study, we explore the latest approaches for assessing battery states, highlight notable advancements in battery management systems (BMS), address existing issues with current BMS technology, and put forth possible solutions for predicting battery charging voltage.
- [939] arXiv:2404.05777 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse WorkloadsComments: EuroMLSys 24, April 22, 2024, Athens, GreeceSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI)
Abstract: This study introduces the Instance-Aware Index Advisor (IA2), a novel deep reinforcement learning (DRL)-based approach for optimizing index selection in databases facing large action spaces of potential candidates. IA2 introduces the Twin Delayed Deep Deterministic Policy Gradient - Temporal Difference State-Wise Action Refinery (TD3-TD-SWAR) model, enabling efficient index selection by understanding workload-index dependencies and employing adaptive action masking. This method includes a comprehensive workload model, enhancing its ability to adapt to unseen workloads and ensuring robust performance across diverse database environments. Evaluation on benchmarks such as TPC-H reveals IA2's suggested indexes' performance in enhancing runtime, securing a 40% reduction in runtime for complex TPC-H workloads compared to scenarios without indexes, and delivering a 20% improvement over existing state-of-the-art DRL-based index advisors.
- [940] arXiv:2404.05779 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Data Readiness for AI: A 360-Degree SurveyComments: 35 pages, 3 figures, 3 tables, submitted to ACM Computing SurveysSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Data are the critical fuel for Artificial Intelligence (AI) models. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Checking for data readiness is a crucial step in improving data quality. Numerous R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used for verifying AI's data readiness. This survey examines more than 120 papers that are published by ACM Digital Library, IEEE Xplore, other reputable journals, and articles published on the web by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy can lead to new standards for DRAI metrics that would be used for enhancing the quality and accuracy of AI training and inference.
- [941] arXiv:2404.05783 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Responsible Generative AI: What to Generate and What NotComments: 74 pages, 10 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In recent years, generative AI (GenAI), like large language models and text-to-image models, has received significant attention across various domains. However, ensuring the responsible generation of content by these models is crucial for their real-world applicability. This raises an interesting question: \textit{What should responsible GenAI generate, and what should it not?} To answer the question, this paper investigates the practical responsible requirements of both textual and visual generative models, outlining five key considerations: generating truthful content, avoiding toxic content, refusing harmful instruction, leaking no training data-related content, and ensuring generated content identifiable. Specifically, we review recent advancements and challenges in addressing these requirements. Besides, we discuss and emphasize the importance of responsible GenAI across healthcare, education, finance, and artificial general intelligence domains. Through a unified perspective on both textual and visual generative models, this paper aims to provide insights into practical safety-related issues and further benefit the community in building responsible GenAI.
- [942] arXiv:2404.05809 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Self-Labeling in Multivariate Causality and Quantification for Adaptive Machine LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Methodology (stat.ME)
Abstract: Adaptive machine learning (ML) aims to allow ML models to adapt to ever-changing environments with potential concept drift after model deployment. Traditionally, adaptive ML requires a new dataset to be manually labeled to tailor deployed models to altered data distributions. Recently, an interactive causality based self-labeling method was proposed to autonomously associate causally related data streams for domain adaptation, showing promising results compared to traditional feature similarity-based semi-supervised learning. Several unanswered research questions remain, including self-labeling's compatibility with multivariate causality and the quantitative analysis of the auxiliary models used in the self-labeling. The auxiliary models, the interaction time model (ITM) and the effect state detector (ESD), are vital to the success of self-labeling. This paper further develops the self-labeling framework and its theoretical foundations to address these research questions. A framework for the application of self-labeling to multivariate causal graphs is proposed using four basic causal relationships, and the impact of non-ideal ITM and ESD performance is analyzed. A simulated experiment is conducted based on a multivariate causal graph, validating the proposed theory.
- [943] arXiv:2404.05825 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level EmbeddingSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
- [944] arXiv:2404.05829 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SambaLingo: Teaching Large Language Models New LanguagesZoltan Csaki , Bo Li , Jonathan Li , Qiantong Xu , Pian Pawakapan , Leon Zhang , Yun Du , Hengyu Zhao , Changran Hu , Urmish ThakkerComments: 23 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.
- [945] arXiv:2404.05840 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Attention-Driven Multi-Agent Reinforcement Learning: Enhancing Decisions with Expertise-Informed TasksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: In this paper, we introduce an alternative approach to enhancing Multi-Agent Reinforcement Learning (MARL) through the integration of domain knowledge and attention-based policy mechanisms. Our methodology focuses on the incorporation of domain-specific expertise into the learning process, which simplifies the development of collaborative behaviors. This approach aims to reduce the complexity and learning overhead typically associated with MARL by enabling agents to concentrate on essential aspects of complex tasks, thus optimizing the learning curve. The utilization of attention mechanisms plays a key role in our model. It allows for the effective processing of dynamic context data and nuanced agent interactions, leading to more refined decision-making. Applied in standard MARL scenarios, such as the Stanford Intelligent Systems Laboratory (SISL) Pursuit and Multi-Particle Environments (MPE) Simple Spread, our method has been shown to improve both learning efficiency and the effectiveness of collaborative behaviors. The results indicate that our attention-based approach can be a viable approach for improving the efficiency of MARL training process, integrating domain-specific knowledge at the action level.
- [946] arXiv:2404.05868 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Negative Preference Optimization: From Catastrophic Collapse to Effective UnlearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Abstract: Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities.
In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through experiments on synthetic data and the benchmark TOFU dataset, we demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. We also observe that NPO-based methods generate more sensible outputs than GA-based methods, whose outputs are often gibberish. Remarkably, on TOFU, NPO-based methods are the first to achieve reasonable unlearning results in forgetting 50% (or more) of the training data, whereas existing methods already struggle with forgetting 10% of training data. - [947] arXiv:2404.05875 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CodecLM: Aligning Language Models with Tailored Synthetic DataZifeng Wang , Chun-Liang Li , Vincent Perot , Long T. Le , Jin Miao , Zizhao Zhang , Chen-Yu Lee , Tomas PfisterComments: Accepted to Findings of NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.
- [948] arXiv:2404.05891 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Condition Monitoring with Incomplete Data: An Integrated Variational Autoencoder and Distance Metric FrameworkSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Condition monitoring of industrial systems is crucial for ensuring safety and maintenance planning, yet notable challenges arise in real-world settings due to the limited or non-existent availability of fault samples. This paper introduces an innovative solution to this problem by proposing a new method for fault detection and condition monitoring for unseen data. Adopting an approach inspired by zero-shot learning, our method can identify faults and assign a relative health index to various operational conditions. Typically, we have plenty of data on normal operations, some data on compromised conditions, and very few (if any) samples of severe faults. We use a variational autoencoder to capture the probabilistic distribution of previously seen and new unseen conditions. The health status is determined by comparing each sample's deviation from a normal operation reference distribution in the latent space. Faults are detected by establishing a threshold for the health indexes, allowing the model to identify severe, unseen faults with high accuracy, even amidst noise. We validate our approach using the run-to-failure IMS-bearing dataset and compare it with other methods. The health indexes generated by our model closely match the established descriptive model of bearing wear, attesting to the robustness and reliability of our method. These findings highlight the potential of our methodology in augmenting fault detection capabilities within industrial domains, thereby contributing to heightened safety protocols and optimized maintenance practices.
- [949] arXiv:2404.05892 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Eagle and Finch: RWKV with Matrix-Valued States and Dynamic RecurrenceBo Peng , Daniel Goldstein , Quentin Anthony , Alon Albalak , Eric Alcaide , Stella Biderman , Eugene Cheah , Xingjian Du , Teddy Ferdinan , Haowen Hou , Przemysław Kazienko , Kranthi Kiran GV , Jan Kocoń , Bartłomiej Koptyra , Satyapriya Krishna , Ronald McClelland Jr. , Niklas Muennighoff , Fares Obeid , Atsushi Saito , Guangyu Song , Haoqin Tu , Stanisław Woźniak , Ruichong Zhang , Bingchen Zhao , Qihang Zhao , Peng Zhou , Jian Zhu , Rui-Jie ZhuSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: this https URL Training code at: this https URL Inference code at: this https URL Time-parallel training code at: this https URL
- [950] arXiv:2404.05894 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning Heuristics for Transit Network Design and Improvement with Deep Reinforcement LearningComments: In preparation for submission to the journal "Transportation Research Part C"Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Transit agencies world-wide face tightening budgets. To maintain quality of service while cutting costs, efficient transit network design is essential. But planning a network of public transit routes is a challenging optimization problem. The most successful approaches to date use metaheuristic algorithms to search through the space of possible transit networks by applying low-level heuristics that randomly alter routes in a network. The design of these low-level heuristics has a major impact on the quality of the result. In this paper we use deep reinforcement learning with graph neural nets to learn low-level heuristics for an evolutionary algorithm, instead of designing them manually. These learned heuristics improve the algorithm's results on benchmark synthetic cities with 70 nodes or more, and obtain state-of-the-art results when optimizing operating costs. They also improve upon a simulation of the real transit network in the city of Laval, Canada, by as much as 54% and 18% on two key metrics, and offer cost savings of up to 12% over the city's existing transit network.
- [951] arXiv:2404.05902 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: WILBUR: Adaptive In-Context Learning for Robust and Accurate Web AgentsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.
- [952] arXiv:2404.05903 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Natural LearningComments: 10 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We introduce Natural Learning (NL), a novel algorithm that elevates the explainability and interpretability of machine learning to an extreme level. NL simplifies decisions into intuitive rules, like "We rejected your loan because your income, employment status, and age collectively resemble a rejected prototype more than an accepted prototype." When applied to real-life datasets, NL produces impressive results. For example, in a colon cancer dataset with 1545 patients and 10935 genes, NL achieves 98.1% accuracy, comparable to DNNs and RF, by analyzing just 3 genes of test samples against 2 discovered prototypes. Similarly, in the UCI's WDBC dataset, NL achieves 98.3% accuracy using only 7 features and 2 prototypes. Even on the MNIST dataset (0 vs. 1), NL achieves 99.5% accuracy with only 3 pixels from 2 prototype images. NL is inspired by prototype theory, an old concept in cognitive psychology suggesting that people learn single sparse prototypes to categorize objects. Leveraging this relaxed assumption, we redesign Support Vector Machines (SVM), replacing its mathematical formulation with a fully nearest-neighbor-based solution, and to address the curse of dimensionality, we utilize locality-sensitive hashing. Following theory's generalizability principle, we propose a recursive method to prune non-core features. As a result, NL efficiently discovers the sparsest prototypes in O(n^2pL) with high parallelization capacity in terms of n. Evaluation of NL with 17 benchmark datasets shows its significant outperformance compared to decision trees and logistic regression, two methods widely favored in healthcare for their interpretability. Moreover, NL achieves performance comparable to finetuned black-box models such as deep neural networks and random forests in 40% of cases, with only a 1-2% lower average accuracy. The code is available via this http URL .
- [953] arXiv:2404.05908 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Interpretability in Symbolic Regression: a benchmark of Explanatory Methods using the Feynman data setGuilherme Seidyo Imai Aldeia , Fabricio Olivetti de Franca (Federal University of ABC)Comments: 47 pages, 10 figures. This is a post peer-review, pre-copyedit version of an article published in Genetic Programming and Evolvable Machines Volume 23, pages 309-349, (2022). The final version is available on this https URLJournal-ref: Aldeia, G.S.I., de Franca, F.O. Interpretability in symbolic regression: a benchmark of explanatory methods using the Feynman data set. Genet Program Evolvable Mach 23, 309-349 (2022)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In some situations, the interpretability of the machine learning models plays a role as important as the model accuracy. Interpretability comes from the need to trust the prediction model, verify some of its properties, or even enforce them to improve fairness. Many model-agnostic explanatory methods exists to provide explanations for black-box models. In the regression task, the practitioner can use white-boxes or gray-boxes models to achieve more interpretable results, which is the case of symbolic regression. When using an explanatory method, and since interpretability lacks a rigorous definition, there is a need to evaluate and compare the quality and different explainers. This paper proposes a benchmark scheme to evaluate explanatory methods to explain regression models, mainly symbolic regression models. Experiments were performed using 100 physics equations with different interpretable and non-interpretable regression methods and popular explanation methods, evaluating the performance of the explainers performance with several explanation measures. In addition, we further analyzed four benchmarks from the GP community. The results have shown that Symbolic Regression models can be an interesting alternative to white-box and black-box models that is capable of returning accurate models with appropriate explanations. Regarding the explainers, we observed that Partial Effects and SHAP were the most robust explanation models, with Integrated Gradients being unstable only with tree-based models. This benchmark is publicly available for further experiments.
- [954] arXiv:2404.05913 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Reinforcement Learning for Personalized Diagnostic Decision Pathways Using Electronic Health Records: A Comparative Study on Anemia and Systemic Lupus ErythematosusComments: arXiv admin note: substantial text overlap with arXiv:2305.06295Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Background: Clinical diagnosis is typically reached by following a series of steps recommended by guidelines authored by colleges of experts. Accordingly, guidelines play a crucial role in rationalizing clinical decisions but suffer from limitations as they are built to cover the majority of the population and fail at covering patients with uncommon conditions. Moreover, their updates are long and expensive, making them unsuitable for emerging diseases and practices.
Methods: Inspired by guidelines, we formulate the task of diagnosis as a sequential decision-making problem and study the use of Deep Reinforcement Learning (DRL) algorithms to learn the optimal sequence of actions to perform in order to obtain a correct diagnosis from Electronic Health Records (EHRs). We apply DRL on synthetic, but realistic EHRs and develop two clinical use cases: Anemia diagnosis, where the decision pathways follow the schema of a decision tree; and Systemic Lupus Erythematosus (SLE) diagnosis, which follows a weighted criteria score. We particularly evaluate the robustness of our approaches to noisy and missing data since these frequently occur in EHRs.
Results: In both use cases, and in the presence of imperfect data, our best DRL algorithms exhibit competitive performance when compared to the traditional classifiers, with the added advantage that they enable the progressive generation of a pathway to the suggested diagnosis which can both guide and explain the decision-making process.
Conclusion: DRL offers the opportunity to learn personalized decision pathways to diagnosis. We illustrate with our two use cases their advantages: they generate step-by-step pathways that are self-explanatory; and their correctness is competitive when compared to state-of-the-art approaches. - [955] arXiv:2404.05920 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Inclusive Practices for Child-Centered AI Design and TestingComments: CHI 2024 Workshop on Child-centred AI Design, May 11, 2024, Honolulu, HI, USASubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: We explore ideas and inclusive practices for designing and testing child-centered artificially intelligent technologies for neurodivergent children. AI is promising for supporting social communication, self-regulation, and sensory processing challenges common for neurodivergent children. The authors, both neurodivergent individuals and related to neurodivergent people, draw from their professional and personal experiences to offer insights on creating AI technologies that are accessible and include input from neurodivergent children. We offer ideas for designing AI technologies for neurodivergent children and considerations for including them in the design process while accounting for their sensory sensitivities. We conclude by emphasizing the importance of adaptable and supportive AI technologies and design processes and call for further conversation to refine child-centered AI design and testing methods.
- [956] arXiv:2404.05943 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Interplay of Machine Translation, Diacritics, and DiacritizationComments: Accepted to NAACL 2024 Main ConferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate.
- [957] arXiv:2404.05950 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Multi-Task Reinforcement Learning via Task-Specific Action CorrectionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed for simultaneous learning of multiple tasks. TSAC decomposes policy learning into two separate policies: a shared policy (SP) and an action correction policy (ACP). To alleviate conflicts resulting from excessive focus on specific tasks' details in SP, ACP incorporates goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. Additional rewards transform the original problem into a multi-objective MTRL problem. Furthermore, to convert the multi-objective MTRL into a single-objective formulation, TSAC assigns a virtual expected budget to the sparse rewards and employs Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. Experimental evaluations conducted on Meta-World's MT10 and MT50 benchmarks demonstrate that TSAC outperforms existing state-of-the-art methods, achieving significant improvements in both sample efficiency and effective action execution.
- [958] arXiv:2404.05955 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained abilities such as OCR, understanding, and grounding. In this paper, we introduce \bench{}, a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks. \bench{} consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude-3 series, and GPT-4V(ision) on \bench{}, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe \bench{} will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.
- [959] arXiv:2404.05959 (cross-list from physics.optics) [ pdf , ps , other ]
-
Title: Map Optical Properties to Subwavelength Structures Directly via a Diffusion ModelShijie Rao , Kaiyu Cui , Yidong Huang , Jiawei Yang , Yali Li , Shengjin Wang , Xue Feng , Fang Liu , Wei ZhangSubjects: Optics (physics.optics) ; Artificial Intelligence (cs.AI)
Abstract: Subwavelength photonic structures and metamaterials provide revolutionary approaches for controlling light. The inverse design methods proposed for these subwavelength structures are vital to the development of new photonic devices. However, most of the existing inverse design methods cannot realize direct mapping from optical properties to photonic structures but instead rely on forward simulation methods to perform iterative optimization. In this work, we exploit the powerful generative abilities of artificial intelligence (AI) and propose a practical inverse design method based on latent diffusion models. Our method maps directly the optical properties to structures without the requirement of forward simulation and iterative optimization. Here, the given optical properties can work as "prompts" and guide the constructed model to correctly "draw" the required photonic structures. Experiments show that our direct mapping-based inverse design method can generate subwavelength photonic structures at high fidelity while following the given optical properties. This may change the method used for optical design and greatly accelerate the research on new photonic devices.
- [960] arXiv:2404.05961 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLM2Vec: Large Language Models Are Secretly Powerful Text EncodersParishad BehnamGhader , Vaibhav Adlakha , Marius Mosbach , Dzmitry Bahdanau , Nicolas Chapados , Siva ReddySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.
- [961] arXiv:2404.05966 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: THOUGHTSCULPT: Reasoning with Intermediate Revision and SearchComments: Code and data available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We present THOUGHTSCULPT, a general reasoning and search method for tasks with outputs that can be decomposed into components. THOUGHTSCULPT explores a search tree of potential solutions using Monte Carlo Tree Search (MCTS), building solutions one action at a time and evaluating according to any domain-specific heuristic, which in practice is often simply an LLM evaluator. Critically, our action space includes revision actions: THOUGHTSCULPT may choose to revise part of its previous output rather than continuing to build the rest of its output. Empirically, THOUGHTSCULPT outperforms state-of-the-art reasoning methods across three challenging tasks: Story Outline Improvement (up to +30% interestingness), Mini-Crosswords Solving (up to +16% word success rate), and Constrained Generation (up to +10% concept coverage).
- [962] arXiv:2404.05967 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: JSTR: Judgment Improves Scene Text RecognitionComments: IntelliSys 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: In this paper, we present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other. While previous studies focused on generating the recognition results from input images, our approach also considers the model's misrecognition results to understand its error tendencies, thus improving the text recognition pipeline. This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize by predicting correct or incorrect between the image and text. The experimental results on publicly available datasets demonstrate that our proposed method outperforms the baseline and state-of-the-art methods in scene text recognition.
- [963] arXiv:2404.05971 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Does Transformer Interpretability Transfer to RNNs?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.
- [964] arXiv:2404.05980 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Tackling Structural Hallucination in Image Translation with Local DiffusionSeunghoi Kim , Chen Jin , Tom Diethe , Matteo Figini , Henry F. J. Tregidgo , Asher Mullokandov , Philip Teare , Daniel C. AlexanderSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent developments in diffusion models have advanced conditioned image generation, yet they struggle with reconstructing out-of-distribution (OOD) images, such as unseen tumors in medical images, causing "image hallucination" and risking misdiagnosis. We hypothesize such hallucinations result from local OOD regions in the conditional images. We verify that partitioning the OOD region and conducting separate image generations alleviates hallucinations in several applications. From this, we propose a training-free diffusion framework that reduces hallucination with multiple Local Diffusion processes. Our approach involves OOD estimation followed by two modules: a "branching" module generates locally both within and outside OOD regions, and a "fusion" module integrates these predictions into one. Our evaluation shows our method mitigates hallucination over baseline models quantitatively and qualitatively, reducing misdiagnosis by 40% and 25% in the real-world medical and natural image datasets, respectively. It also demonstrates compatibility with various pre-trained diffusion models.
- [965] arXiv:2404.05990 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Automatic Authorities: Power and AISubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: As rapid advances in Artificial Intelligence and the rise of some of history's most potent corporations meet the diminished neoliberal state, people are increasingly subject to power exercised by means of automated systems. Machine learning and related computational technologies now underpin vital government services. They connect consumers and producers in new algorithmic markets. They determine how we find out about everything from how to vote to where to get vaccinated, and whose speech is amplified, reduced, or restricted. And a new wave of products based on Large Language Models (LLMs) will further transform our economic and political lives. Automatic Authorities are automated computational systems used to exercise power over us by determining what we may know, what we may have, and what our options will be. In response to their rise, scholars working on the societal impacts of AI and related technologies have advocated shifting attention from how to make AI systems beneficial or fair towards a critical analysis of these new power relations. But power is everywhere, and is not necessarily bad. On what basis should we object to new or intensified power relations, and what can be done to justify them? This paper introduces the philosophical materials with which to formulate these questions, and offers preliminary answers. It starts by pinning down the concept of power, focusing on the ability that some agents have to shape others' lives. It then explores how AI enables and intensifies the exercise of power so understood, and sketches three problems with power and three ways to solve those problems. It emphasises, in particular, that justifying power requires more than satisfying substantive justificatory criteria; standards of proper authority and procedural legitimacy must also be met. We need to know not only what power may be used for, but how it may be used, and by whom.
- [966] arXiv:2404.06003 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language ModelsZhuohao Yu , Chang Gao , Wenjin Yao , Yidong Wang , Zhengran Zeng , Wei Ye , Jindong Wang , Yue Zhang , Shikun ZhangComments: We open-source all our code at: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the reliability of evaluation findings is often questionable due to potential data contamination, with the evaluation efficiency commonly overlooked when facing the substantial costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of LLMs. Firstly, FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies, encompassing dynamic evaluation that demand sophisticated LLM interactions. Secondly, the framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules in the platform, enhance the fairness of the evaluation outcomes. Lastly, FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs.
- [967] arXiv:2404.06007 (cross-list from cs.IT) [ pdf , ps , html , other ]
-
Title: Collaborative Edge AI Inference over Cloud-RANComments: This paper is accepted by IEEE Transactions on Communications on 08-Apr-2024Subjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregation, we allow each RRH receives local feature vectors from all devices over the same resource blocks simultaneously by leveraging an over-the-air computation (AirComp) technique. Thereafter, these aggregated feature vectors are quantized and transmitted to a central processor (CP) for further aggregation and downstream inference tasks. Our aim in this work is to maximize the inference accuracy via a surrogate accuracy metric called discriminant gain, which measures the discernibility of different classes in the feature space. The key challenges lie on simultaneously suppressing the coupled sensing noise, AirComp distortion caused by hostile wireless channels, and the quantization error resulting from the limited capacity of fronthaul links. To address these challenges, this work proposes a joint transmit precoding, receive beamforming, and quantization error control scheme to enhance the inference accuracy. Extensive numerical experiments demonstrate the effectiveness and superiority of our proposed optimization algorithm compared to various baselines.
- [968] arXiv:2404.06014 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Using 3-Objective Evolutionary Algorithms for the Dynamic Chance Constrained Knapsack ProblemSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Real-world optimization problems often involve stochastic and dynamic components. Evolutionary algorithms are particularly effective in these scenarios, as they can easily adapt to uncertain and changing environments but often uncertainty and dynamic changes are studied in isolation. In this paper, we explore the use of 3-objective evolutionary algorithms for the chance constrained knapsack problem with dynamic constraints. In our setting, the weights of the items are stochastic and the knapsack's capacity changes over time. We introduce a 3-objective formulation that is able to deal with the stochastic and dynamic components at the same time and is independent of the confidence level required for the constraint. This new approach is then compared to the 2-objective formulation which is limited to a single confidence level. We evaluate the approach using two different multi-objective evolutionary algorithms (MOEAs), namely the global simple evolutionary multi-objective optimizer (GSEMO) and the multi-objective evolutionary algorithm based on decomposition (MOEA/D), across various benchmark scenarios. Our analysis highlights the advantages of the 3-objective formulation over the 2-objective formulation in addressing the dynamic chance constrained knapsack problem.
- [969] arXiv:2404.06022 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Band-Attention Modulated RetNet for Face Forgery DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational this http URL mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts while avoiding catastrophic forgetting.Our approach empowers the target token to perceive global information by assigning differential attention levels to tokens at varying distances. We implement self-attention along both spatial axes, thereby maintaining spatial priors and easing the computational burden.Moreover, we present the adaptive frequency Band-Attention Modulation mechanism, which treats the entire Discrete Cosine Transform spectrogram as a series of frequency bands with learnable weights.Together, BAR-Net achieves favorable performance on several face forgery datasets, outperforming current state-of-the-art methods.
- [970] arXiv:2404.06025 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Greedy-DiM: Greedy Algorithms for Unreasonably Effective Face MorphsComments: Initial preprint. Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Morphing attacks are an emerging threat to state-of-the-art Face Recognition (FR) systems, which aim to create a single image that contains the biometric information of multiple identities. Diffusion Morphs (DiM) are a recently proposed morphing attack that has achieved state-of-the-art performance for representation-based morphing attacks. However, none of the existing research on DiMs have leveraged the iterative nature of DiMs and left the DiM model as a black box, treating it no differently than one would a Generative Adversarial Network (GAN) or Varational AutoEncoder (VAE). We propose a greedy strategy on the iterative sampling process of DiM models which searches for an optimal step guided by an identity-based heuristic function. We compare our proposed algorithm against ten other state-of-the-art morphing algorithms using the open-source SYN-MAD 2022 competition dataset. We find that our proposed algorithm is unreasonably effective, fooling all of the tested FR systems with an MMPMR of 100%, outperforming all other morphing algorithms compared.
- [971] arXiv:2404.06059 (cross-list from quant-ph) [ pdf , ps , other ]
-
Title: Efficient Quantum Circuits for Machine Learning Activation Functions including Constant T-depth ReLUComments: 13 pagesSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, Quantum Machine Learning (QML) has increasingly captured the interest of researchers. Among the components in this domain, activation functions hold a fundamental and indispensable role. Our research focuses on the development of activation functions quantum circuits for integration into fault-tolerant quantum computing architectures, with an emphasis on minimizing $T$-depth. Specifically, we present novel implementations of ReLU and leaky ReLU activation functions, achieving constant $T$-depths of 4 and 8, respectively. Leveraging quantum lookup tables, we extend our exploration to other activation functions such as the sigmoid. This approach enables us to customize precision and $T$-depth by adjusting the number of qubits, making our results more adaptable to various application scenarios. This study represents a significant advancement towards enhancing the practicality and application of quantum machine learning.
- [972] arXiv:2404.06063 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: All in One: An Empirical Study of GPT for Few-Shot Aspect-Based Sentiment AnlaysisComments: 9 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Aspect-Based Sentiment Analysis (ABSA) is an indispensable and highly challenging task in natural language processing. Current efforts have focused on specific sub-tasks, making it difficult to comprehensively cover all sub-tasks within the ABSA domain. With the development of Generative Pre-trained Transformers (GPTs), there came inspiration for a one-stop solution to sentiment analysis. In this study, we used GPTs for all sub-tasks of few-shot ABSA while defining a general learning paradigm for this application. We propose the All in One (AiO) model, a simple yet effective two-stage model for all ABSA sub-tasks. In the first stage, a specific backbone network learns the semantic information of the review and generates heuristically enhanced candidates. In the second stage, AiO leverages GPT contextual learning capabilities to generate predictions. The study conducted comprehensive comparative and ablation experiments on five benchmark datasets, and the results show that AiO can effectively handle all ABSA sub-tasks, even with few-shot data.
- [973] arXiv:2404.06077 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Is Your AI Truly Yours? Leveraging Blockchain for Copyrights, Provenance, and LineageSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: As Artificial Intelligence (AI) integrates into diverse areas, particularly in content generation, ensuring rightful ownership and ethical use becomes paramount. AI service providers are expected to prioritize responsibly sourcing training data and obtaining licenses from data owners. However, existing studies primarily center on safeguarding static copyrights, which simply treats metadata/datasets as non-fungible items with transferable/trading capabilities, neglecting the dynamic nature of training procedures that can shape an ongoing trajectory.
In this paper, we present \textsc{IBis}, a blockchain-based framework tailored for AI model training workflows. \textsc{IBis} integrates on-chain registries for datasets, licenses and models, alongside off-chain signing services to facilitate collaboration among multiple participants. Our framework addresses concerns regarding data and model provenance and copyright compliance. \textsc{IBis} enables iterative model retraining and fine-tuning, and offers flexible license checks and renewals. Further, \textsc{IBis} provides APIs designed for seamless integration with existing contract management software, minimizing disruptions to established model training processes. We implement \textsc{IBis} using Daml on the Canton blockchain. Evaluation results showcase the feasibility and scalability of \textsc{IBis} across varying numbers of users, datasets, models, and licenses. - [974] arXiv:2404.06079 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit ChallengeYiwei Guo , Chenrun Wang , Yifan Yang , Hankun Wang , Ziyang Ma , Chenpeng Du , Shuai Wang , Hanzheng Li , Shuai Fan , Hui Zhang , Xie Chen , Kai YuComments: 5 pages, 3 figures. Report of a challengeSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI)
Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, with the highest UTMOS score and lowest bitrate among all submissions.
- [975] arXiv:2404.06090 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Fair Graph Neural Network with Supervised Contrastive RegularizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, Graph Neural Networks (GNNs) have made significant advancements, particularly in tasks such as node classification, link prediction, and graph representation. However, challenges arise from biases that can be hidden not only in the node attributes but also in the connections between entities. Therefore, ensuring fairness in graph neural network learning has become a critical problem. To address this issue, we propose a novel model for training fairness-aware GNN, which enhances the Counterfactual Augmented Fair Graph Neural Network Framework (CAF). Our approach integrates Supervised Contrastive Loss and Environmental Loss to enhance both accuracy and fairness. Experimental validation on three real datasets demonstrates the superiority of our proposed model over CAF and several other existing graph-based learning methods.
- [976] arXiv:2404.06114 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive SurveySubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, models, and resources. Due to intensive synchronization of models and sharing of data across GPUs and computing nodes during distributed training and inference processes, communication efficiency becomes the bottleneck for achieving high performance at a large scale. This article surveys the literature over the period of 2018-2023 on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning at various levels, including algorithms, frameworks, and infrastructures. Specifically, we first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training. Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference. After that, we present the latest technologies pertaining to modern communication infrastructures used in distributed deep learning with a focus on examining the impact of the communication overhead in a large-scale and heterogeneous setting. Finally, we conduct a case study on the distributed training of large language models at a large scale to illustrate how to apply these technologies in real cases. This article aims to offer researchers a comprehensive understanding of the current landscape of large-scale distributed deep learning and to reveal promising future research directions toward communication-efficient solutions in this scope.
- [977] arXiv:2404.06124 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Hierarchical Insights: Exploiting Structural Similarities for Reliable 3D Semantic SegmentationComments: submitted to IROS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Safety-critical applications like autonomous driving call for robust 3D environment perception algorithms which can withstand highly diverse and ambiguous surroundings. The predictive performance of any classification model strongly depends on the underlying dataset and the prior knowledge conveyed by the annotated labels. While the labels provide a basis for the learning process, they usually fail to represent inherent relations between the classes - representations, which are a natural element of the human perception system. We propose a training strategy which enables a 3D LiDAR semantic segmentation model to learn structural relationships between the different classes through abstraction. We achieve this by implicitly modeling those relationships through a learning rule for hierarchical multi-label classification (HMC). With a detailed analysis we show, how this training strategy not only improves the model's confidence calibration, but also preserves additional information for downstream tasks like fusion, prediction and planning.
- [978] arXiv:2404.06127 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: FLEX: FLEXible Federated Learning FrameworkFrancisco Herrera , Daniel Jiménez-López , Alberto Argente-Garrido , Nuria Rodríguez-Barroso , Cristina Zuheros , Ignacio Aguilera-Martos , Beatriz Bello , Mario García-Márquez , M. Victoria LuzónComments: Submitted to Information FusionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of Artificial Intelligence (AI), the need for privacy and security in data processing has become paramount. As AI applications continue to expand, the collection and handling of sensitive data raise concerns about individual privacy protection. Federated Learning (FL) emerges as a promising solution to address these challenges by enabling decentralized model training on local devices, thus preserving data privacy. This paper introduces FLEX: a FLEXible Federated Learning Framework designed to provide maximum flexibility in FL research experiments. By offering customizable features for data distribution, privacy parameters, and communication strategies, FLEX empowers researchers to innovate and develop novel FL techniques. The framework also includes libraries for specific FL implementations including: (1) anomalies, (2) blockchain, (3) adversarial attacks and defences, (4) natural language processing and (5) decision trees, enhancing its versatility and applicability in various domains. Overall, FLEX represents a significant advancement in FL research, facilitating the development of robust and efficient FL applications.
- [979] arXiv:2404.06137 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination DetectionElisei Rykov , Yana Shishkina , Kseniia Petrushina , Kseniia Titova , Sergey Petrakov , Alexander PanchenkoComments: 12 pages, 10 tables, 3 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.
- [980] arXiv:2404.06144 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Differential Privacy for Anomaly Detection: Analyzing the Trade-off Between Privacy and ExplainabilityFatima Ezzeddine , Mirna Saad , Omran Ayoub , Davide Andreoletti , Martin Gjoreski , Ihab Sbeity , Marc Langheinrich , Silvia GiordanoSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Anomaly detection (AD), also referred to as outlier detection, is a statistical process aimed at identifying observations within a dataset that significantly deviate from the expected pattern of the majority of the data. Such a process finds wide application in various fields, such as finance and healthcare. While the primary objective of AD is to yield high detection accuracy, the requirements of explainability and privacy are also paramount. The first ensures the transparency of the AD process, while the second guarantees that no sensitive information is leaked to untrusted parties. In this work, we exploit the trade-off of applying Explainable AI (XAI) through SHapley Additive exPlanations (SHAP) and differential privacy (DP). We perform AD with different models and on various datasets, and we thoroughly evaluate the cost of privacy in terms of decreased accuracy and explainability. Our results show that the enforcement of privacy through DP has a significant impact on detection accuracy and explainability, which depends on both the dataset and the considered AD model. We further show that the visual interpretation of explanations is also influenced by the choice of the AD algorithm.
- [981] arXiv:2404.06152 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HFNeRF: Learning Human Biomechanic Features with Neural Radiance FieldsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In recent advancements in novel view synthesis, generalizable Neural Radiance Fields (NeRF) based methods applied to human subjects have shown remarkable results in generating novel views from few images. However, this generalization ability cannot capture the underlying structural features of the skeleton shared across all instances. Building upon this, we introduce HFNeRF: a novel generalizable human feature NeRF aimed at generating human biomechanic features using a pre-trained image encoder. While previous human NeRF methods have shown promising results in the generation of photorealistic virtual avatars, such methods lack underlying human structure or biomechanic features such as skeleton or joint information that are crucial for downstream applications including Augmented Reality (AR)/Virtual Reality (VR). HFNeRF leverages 2D pre-trained foundation models toward learning human features in 3D using neural rendering, and then volume rendering towards generating 2D feature maps. We evaluate HFNeRF in the skeleton estimation task by predicting heatmaps as features. The proposed method is fully differentiable, allowing to successfully learn color, geometry, and human skeleton in a simultaneous manner. This paper presents preliminary results of HFNeRF, illustrating its potential in generating realistic virtual avatars with biomechanic features using NeRF.
- [982] arXiv:2404.06162 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Characterizing Multimodal Long-form Summarization: A Case Study on Financial ReportsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.
- [983] arXiv:2404.06167 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph EmbeddingComments: Accepted as a long paper for the research track at DASFAA 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
Abstract: Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular heterogeneity and diversity, offering invaluable insights for bioinformatics advancements. Despite its potential, traditional clustering methods in scRNA-seq data analysis often neglect the structural information embedded in gene expression profiles, crucial for understanding cellular correlations and dependencies. Existing strategies, including graph neural networks, face challenges in handling the inefficiency due to scRNA-seq data's intrinsic high-dimension and high-sparsity. Addressing these limitations, we introduce scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel framework designed for efficient and accurate clustering of scRNA-seq data that simultaneously utilizes intercellular high-order structural information. scCDCG comprises three main components: (i) A graph embedding module utilizing deep cut-informed techniques, which effectively captures intercellular high-order structural information, overcoming the over-smoothing and inefficiency issues prevalent in prior graph neural network methods. (ii) A self-supervised learning module guided by optimal transport, tailored to accommodate the unique complexities of scRNA-seq data, specifically its high-dimension and high-sparsity. (iii) An autoencoder-based feature learning module that simplifies model complexity through effective dimension reduction and feature extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's superior performance and efficiency compared to 7 established models, underscoring scCDCG's potential as a transformative tool in scRNA-seq data analysis. Our code is available at: this https URL .
- [984] arXiv:2404.06170 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as TeachersComments: Short paper - 5 pages; 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Contrastive Language-Image Pre-training (CLIP) has been shown to improve zero-shot generalization capabilities of language and vision models. In this paper, we extend CLIP for efficient knowledge distillation, by utilizing embeddings as teachers. Typical knowledge distillation frameworks require running forward passes through a teacher model, which is often prohibitive in the case of billion or trillion parameter teachers. In these cases, using only the embeddings of the teacher models to guide the distillation can yield significant computational savings. Our preliminary findings show that CLIP-based knowledge distillation with embeddings can outperform full scale knowledge distillation using $9\times$ less memory and $8\times$ less training time. Code available at: this https URL
- [985] arXiv:2404.06177 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Uncertainty-aware Evidential Fusion-based Learning for Semi-supervised Medical Image SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Although the existing uncertainty-based semi-supervised medical segmentation methods have achieved excellent performance, they usually only consider a single uncertainty evaluation, which often fails to solve the problem related to credibility completely. Therefore, based on the framework of evidential deep learning, this paper integrates the evidential predictive results in the cross-region of mixed and original samples to reallocate the confidence degree and uncertainty measure of each voxel, which is realized by emphasizing uncertain information of probability assignments fusion rule of traditional evidence theory. Furthermore, we design a voxel-level asymptotic learning strategy by introducing information entropy to combine with the fused uncertainty measure to estimate voxel prediction more precisely. The model will gradually pay attention to the prediction results with high uncertainty in the learning process, to learn the features that are difficult to master. The experimental results on LA, Pancreas-CT, ACDC and TBAD datasets demonstrate the superior performance of our proposed method in comparison with the existing state of the arts.
- [986] arXiv:2404.06181 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EPL: Evidential Prototype Learning for Semi-supervised Medical Image SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Although current semi-supervised medical segmentation methods can achieve decent performance, they are still affected by the uncertainty in unlabeled data and model predictions, and there is currently a lack of effective strategies that can explore the uncertain aspects of both simultaneously. To address the aforementioned issues, we propose Evidential Prototype Learning (EPL), which utilizes an extended probabilistic framework to effectively fuse voxel probability predictions from different sources and achieves prototype fusion utilization of labeled and unlabeled data under a generalized evidential framework, leveraging voxel-level dual uncertainty masking. The uncertainty not only enables the model to self-correct predictions but also improves the guided learning process with pseudo-labels and is able to feed back into the construction of hidden features. The method proposed in this paper has been experimented on LA, Pancreas-CT and TBAD datasets, achieving the state-of-the-art performance in three different labeled ratios, which strongly demonstrates the effectiveness of our strategy.
- [987] arXiv:2404.06186 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Clue-Instruct: Text-Based Clue Generation for Educational Crossword PuzzlesAndrea Zugarini , Kamyar Zeinalipour , Surya Sai Kadali , Marco Maggini , Marco Gori , Leonardo RigutiniSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Crossword puzzles are popular linguistic games often used as tools to engage students in learning. Educational crosswords are characterized by less cryptic and more factual clues that distinguish them from traditional crossword puzzles. Despite there exist several publicly available clue-answer pair databases for traditional crosswords, educational clue-answer pairs datasets are missing. In this article, we propose a methodology to build educational clue generation datasets that can be used to instruct Large Language Models (LLMs). By gathering from Wikipedia pages informative content associated with relevant keywords, we use Large Language Models to automatically generate pedagogical clues related to the given input keyword and its context. With such an approach, we created clue-instruct, a dataset containing 44,075 unique examples with text-keyword pairs associated with three distinct crossword clues. We used clue-instruct to instruct different LLMs to generate educational clues from a given input content and keyword. Both human and automatic evaluations confirmed the quality of the generated clues, thus validating the effectiveness of our approach.
- [988] arXiv:2404.06188 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of $Q$-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of $Q$-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.
- [989] arXiv:2404.06201 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Open-Source AI-based SE Tools: Opportunities and Challenges of Collaborative Software LearningZhihao Lin , Wei Ma , Tao Lin , Yaowen Zheng , Jingquan Ge , Jun Wang , Jacques Klein , Tegawende Bissyande , Yang Liu , Li LiSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have become instrumental in advancing software engineering (SE) tasks, showcasing their efficacy in code understanding and beyond. Like traditional SE tools, open-source collaboration is key in realising the excellent products. However, with AI models, the essential need is in data. The collaboration of these AI-based SE models hinges on maximising the sources of high-quality data. However, data especially of high quality, often holds commercial or sensitive value, making it less accessible for open-source AI-based SE projects. This reality presents a significant barrier to the development and enhancement of AI-based SE tools within the software engineering community. Therefore, researchers need to find solutions for enabling open-source AI-based SE models to tap into resources by different organisations. Addressing this challenge, our position paper investigates one solution to facilitate access to diverse organizational resources for open-source AI models, ensuring privacy and commercial sensitivities are respected. We introduce a governance framework centered on federated learning (FL), designed to foster the joint development and maintenance of open-source AI code models while safeguarding data privacy and security. Additionally, we present guidelines for developers on AI-based SE tool collaboration, covering data requirements, model architecture, updating strategies, and version control. Given the significant influence of data characteristics on FL, our research examines the effect of code data heterogeneity on FL performance.
- [990] arXiv:2404.06209 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at this https URL
- [991] arXiv:2404.06212 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OmniFusion Technical ReportElizaveta Goncharova , Anton Razzhigaev , Matvey Mikhalchuk , Maxim Kurkin , Irina Abdullaeva , Matvey Skripkin , Ivan Oseledets , Denis Dimitrov , Andrey KuznetsovComments: 17 pages, 4 figures, 9 tables, 2 appendicesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at this https URL .
- [992] arXiv:2404.06224 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Low-Cost Generation and Evaluation of Dictionary Example SentencesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Dictionary example sentences play an important role in illustrating word definitions and usage, but manually creating quality sentences is challenging. Prior works have demonstrated that language models can be trained to generate example sentences. However, they relied on costly customized models and word sense datasets for generation and evaluation of their work. Rapid advancements in foundational models present the opportunity to create low-cost, zero-shot methods for the generation and evaluation of dictionary example sentences. We introduce a new automatic evaluation metric called OxfordEval that measures the win-rate of generated sentences against existing Oxford Dictionary sentences. OxfordEval shows high alignment with human judgments, enabling large-scale automated quality evaluation. We experiment with various LLMs and configurations to generate dictionary sentences across word classes. We complement this with a novel approach of using masked language models to identify and select sentences that best exemplify word meaning. The eventual model, FM-MLM, achieves over 85.1% win rate against Oxford baseline sentences according to OxfordEval, compared to 39.8% win rate for prior model-generated sentences.
- [993] arXiv:2404.06229 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Towards Autonomous Driving with Small-Scale Cars: A Survey of Recent DevelopmentSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: While engaging with the unfolding revolution in autonomous driving, a challenge presents itself, how can we effectively raise awareness within society about this transformative trend? While full-scale autonomous driving vehicles often come with a hefty price tag, the emergence of small-scale car platforms offers a compelling alternative. These platforms not only serve as valuable educational tools for the broader public and young generations but also function as robust research platforms, contributing significantly to the ongoing advancements in autonomous driving technology. This survey outlines various small-scale car platforms, categorizing them and detailing the research advancements accomplished through their usage. The conclusion provides proposals for promising future directions in the field.
- [994] arXiv:2404.06243 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in VideosSharana Dharshikgan Suresh Dass , Hrishav Bakul Barua , Ganesh Krishnasamy , Raveendran Paramesran , Raphael C.-W. PhanComments: Submitted for peer reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: this https URL .
- [995] arXiv:2404.06246 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance FieldsArnab Dey , Di Yang , Rohith Agaram , Antitza Dantcheva , Andrew I. Comport , Srinath Sridhar , Jean MartinetSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advances in Neural Radiance Fields (NeRF) have demonstrated promising results in 3D scene representations, including 3D human representations. However, these representations often lack crucial information on the underlying human pose and structure, which is crucial for AR/VR applications and games. In this paper, we introduce a novel approach, termed GHNeRF, designed to address these limitations by learning 2D/3D joint locations of human subjects with NeRF representation. GHNeRF uses a pre-trained 2D encoder streamlined to extract essential human features from 2D images, which are then incorporated into the NeRF framework in order to encode human biomechanic features. This allows our network to simultaneously learn biomechanic features, such as joint locations, along with human geometry and texture. To assess the effectiveness of our method, we conduct a comprehensive comparison with state-of-the-art human NeRF techniques and joint estimation algorithms. Our results show that GHNeRF can achieve state-of-the-art results in near real-time.
- [996] arXiv:2404.06261 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Playing to Vision Foundation Model's Strengths in Stereo MatchingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.
- [997] arXiv:2404.06267 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: PGTNet: A Process Graph Transformer Network for Remaining Time Prediction of Business Process InstancesComments: 16 pages, 4 figures, To be published in: Advanced Information Systems Engineering - 36th International Conference, CAiSE 2024, Limassol, Cyprus, June 03-07, 2024, ProceedingsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We present PGTNet, an approach that transforms event logs into graph datasets and leverages graph-oriented data for training Process Graph Transformer Networks to predict the remaining time of business process instances. PGTNet consistently outperforms state-of-the-art deep learning approaches across a diverse range of 20 publicly available real-world event logs. Notably, our approach is most promising for highly complex processes, where existing deep learning approaches encounter difficulties stemming from their limited ability to learn control-flow relationships among process activities and capture long-range dependencies. PGTNet addresses these challenges, while also being able to consider multiple process perspectives during the learning process.
- [998] arXiv:2404.06278 (cross-list from cs.DB) [ pdf , ps , other ]
-
Title: Dimensionality Reduction in Sentence Transformer Vector Databases with Fast Fourier TransformComments: 13 pages, 5 figuresSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Dimensionality reduction in vector databases is pivotal for streamlining AI data management, enabling efficient storage, faster computation, and improved model performance. This paper explores the benefits of reducing vector database dimensions, with a focus on computational efficiency and overcoming the curse of dimensionality. We introduce a novel application of Fast Fourier Transform (FFT) to dimensionality reduction, a method previously underexploited in this context. By demonstrating its utility across various AI domains, including Retrieval-Augmented Generation (RAG) models and image processing, this FFT-based approach promises to improve data retrieval processes and enhance the efficiency and scalability of AI solutions. The incorporation of FFT may not only optimize operations in real-time processing and recommendation systems but also extend to advanced image processing techniques, where dimensionality reduction can significantly improve performance and analysis efficiency. This paper advocates for the broader adoption of FFT in vector database management, marking a significant stride towards addressing the challenges of data volume and complexity in AI research and applications. Unlike many existing approaches, we directly handle the embedding vectors produced by the model after processing a test input.
- [999] arXiv:2404.06279 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: NoiseNCA: Noisy Seed Improves Spatio-Temporal Continuity of Neural Cellular AutomataComments: 9 pages, 12 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Multiagent Systems (cs.MA)
Abstract: Neural Cellular Automata (NCA) is a class of Cellular Automata where the update rule is parameterized by a neural network that can be trained using gradient descent. In this paper, we focus on NCA models used for texture synthesis, where the update rule is inspired by partial differential equations (PDEs) describing reaction-diffusion systems. To train the NCA model, the spatio-termporal domain is discretized, and Euler integration is used to numerically simulate the PDE. However, whether a trained NCA truly learns the continuous dynamic described by the corresponding PDE or merely overfits the discretization used in training remains an open question. We study NCA models at the limit where space-time discretization approaches continuity. We find that existing NCA models tend to overfit the training discretization, especially in the proximity of the initial condition, also called "seed". To address this, we propose a solution that utilizes uniform noise as the initial condition. We demonstrate the effectiveness of our approach in preserving the consistency of NCA dynamics across a wide range of spatio-temporal granularities. Our improved NCA model enables two new test-time interactions by allowing continuous control over the speed of pattern formation and the scale of the synthesized patterns. We demonstrate this new NCA feature in our interactive online demo. Our work reveals that NCA models can learn continuous dynamics and opens new venues for NCA research from a dynamical systems' perspective.
- [1000] arXiv:2404.06324 (cross-list from cs.NI) [ pdf , ps , other ]
-
Title: Dynamic D2D-Assisted Federated Learning over O-RAN: Performance Analysis, MAC Scheduler, and Asymmetric User SelectionComments: 120 pages, 13 figuresSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Existing studies on federated learning (FL) are mostly focused on system orchestration for static snapshots of the network and making static control decisions (e.g., spectrum allocation). However, real-world wireless networks are susceptible to temporal variations of wireless channel capacity and users' datasets. In this paper, we incorporate multi-granular system dynamics (MSDs) into FL, including (M1) dynamic wireless channel capacity, captured by a set of discrete-time events, called $\mathscr{D}$-Events, and (M2) dynamic datasets of users. The latter is characterized by (M2-a) modeling the dynamics of user's dataset size via an ordinary differential equation and (M2-b) introducing dynamic model drift}, formulated via a partial differential inequality} drawing concrete analytical connections between the dynamics of users' datasets and FL accuracy. We then conduct FL orchestration under MSDs by introducing dynamic cooperative FL with dedicated MAC schedulers (DCLM), exploiting the unique features of open radio access network (O-RAN). DCLM proposes (i) a hierarchical device-to-device (D2D)-assisted model training, (ii) dynamic control decisions through dedicated O-RAN MAC schedulers, and (iii) asymmetric user selection. We provide extensive theoretical analysis to study the convergence of DCLM. We then optimize the degrees of freedom (e.g., user selection and spectrum allocation) in DCLM through a highly non-convex optimization problem. We develop a systematic approach to obtain the solution for this problem, opening the door to solving a broad variety of network-aware FL optimization problems. We show the efficiency of DCLM via numerical simulations and provide a series of future directions.
- [1001] arXiv:2404.06326 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: What is the $\textit{intrinsic}$ dimension of your binary data? -- and how to compute it quicklySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Dimensionality is an important aspect for analyzing and understanding (high-dimensional) data. In their 2006 ICDM paper Tatti et al. answered the question for a (interpretable) dimension of binary data tables by introducing a normalized correlation dimension. In the present work we revisit their results and contrast them with a concept based notion of intrinsic dimension (ID) recently introduced for geometric data sets. To do this, we present a novel approximation for this ID that is based on computing concepts only up to a certain support value. We demonstrate and evaluate our approximation using all available datasets from Tatti et al., which have between 469 and 41271 extrinsic dimensions.
- [1002] arXiv:2404.06330 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative Pre-Trained Transformer for Symbolic Regression Base In-Context Reinforcement LearningYanjie Li , Weijun Li , Lina Yu , Min Wu , Jingyi Liu , Wenqiang Li , Meilan Hao , Shu Wei , Yusong DengComments: 21 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The mathematical formula is the human language to describe nature and is the essence of scientific research. Finding mathematical formulas from observational data is a major demand of scientific research and a major challenge of artificial intelligence. This area is called symbolic regression. Originally symbolic regression was often formulated as a combinatorial optimization problem and solved using GP or reinforcement learning algorithms. These two kinds of algorithms have strong noise robustness ability and good Versatility. However, inference time usually takes a long time, so the search efficiency is relatively low. Later, based on large-scale pre-training data proposed, such methods use a large number of synthetic data points and expression pairs to train a Generative Pre-Trained Transformer(GPT). Then this GPT can only need to perform one forward propagation to obtain the results, the advantage is that the inference speed is very fast. However, its performance is very dependent on the training data and performs poorly on data outside the training set, which leads to poor noise robustness and Versatility of such methods. So, can we combine the advantages of the above two categories of SR algorithms? In this paper, we propose \textbf{FormulaGPT}, which trains a GPT using massive sparse reward learning histories of reinforcement learning-based SR algorithms as training data. After training, the SR algorithm based on reinforcement learning is distilled into a Transformer. When new test data comes, FormulaGPT can directly generate a "reinforcement learning process" and automatically update the learning policy in context. Tested on more than ten datasets including SRBench, formulaGPT achieves the state-of-the-art performance in fitting ability compared with four baselines. In addition, it achieves satisfactory results in noise robustness, versatility, and inference efficiency.
- [1003] arXiv:2404.06353 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: High Noise Scheduling is a MustSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Consistency models possess high capabilities for image generation, advancing sampling steps to a single step through their advanced techniques. Current advancements move one step forward consistency training techniques and eliminates the limitation of distillation training. Even though the proposed curriculum and noise scheduling in improved training techniques yield better results than basic consistency models, it lacks well balanced noise distribution and its consistency between curriculum. In this study, it is investigated the balance between high and low noise levels in noise distribution and offered polynomial noise distribution to maintain the stability. This proposed polynomial noise distribution is also supported with a predefined Karras noises to prevent unique noise levels arises with Karras noise generation algorithm. Furthermore, by elimination of learned noisy steps with a curriculum based on sinusoidal function increase the performance of the model in denoising. To make a fair comparison with the latest released consistency model training techniques, experiments are conducted with same hyper-parameters except curriculum and noise distribution. The models utilized during experiments are determined with low depth to prove the robustness of our proposed technique. The results show that the polynomial noise distribution outperforms the model trained with log-normal noise distribution, yielding a 33.54 FID score after 100,000 training steps with constant discretization steps. Additionally, the implementation of a sinusoidal-based curriculum enhances denoising performance, resulting in a FID score of 30.48.
- [1004] arXiv:2404.06356 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Policy-Guided DiffusionMatthew Thomas Jackson , Michael Tryfan Matthews , Cong Lu , Benjamin Ellis , Shimon Whiteson , Jakob FoersterComments: Previously at the NeurIPS 2023 Workshop on Robot LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy. Such a setting naturally leads to distribution shift between the behavior policy and the target policy being trained - requiring policy conservatism to avoid instability and overestimation bias. Autoregressive world models offer a different solution to this by generating synthetic, on-policy experience. However, in practice, model rollouts must be severely truncated to avoid compounding error. As an alternative, we propose policy-guided diffusion. Our method uses diffusion models to generate entire trajectories under the behavior distribution, applying guidance from the target policy to move synthetic experience further on-policy. We show that policy-guided diffusion models a regularized form of the target distribution that balances action likelihood under both the target and behavior policies, leading to plausible trajectories with high target policy probability, while retaining a lower dynamics error than an offline world model baseline. Using synthetic experience from policy-guided diffusion as a drop-in substitute for real data, we demonstrate significant improvements in performance across a range of standard offline reinforcement learning algorithms and environments. Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.
- [1005] arXiv:2404.06362 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero shot Medical Image SegmentationSidra Aleem , Fangyijie Wang , Mayug Maniparambil , Eric Arazo , Julia Dietlmeier , Guenole Silvestre , Kathleen Curran , Noel E. O'Connor , Suzanne LittleSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The Segment Anything Model (SAM) and CLIP are remarkable vision foundation models (VFMs). SAM, a prompt driven segmentation model, excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero shot recognition capabilities. However, their unified potential has not yet been explored in medical image segmentation. To adapt SAM to medical imaging, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. This work presents an in depth exploration of integrating SAM and CLIP into a unified framework for medical image segmentation. Specifically, we propose a simple unified framework, SaLIP, for organ segmentation. Initially, SAM is used for part based segmentation within the image, followed by CLIP to retrieve the mask corresponding to the region of interest (ROI) from the pool of SAM generated masks. Finally, SAM is prompted by the retrieved ROI to segment a specific organ. Thus, SaLIP is training and fine tuning free and does not rely on domain expertise or labeled data for prompt engineering. Our method shows substantial enhancements in zero shot segmentation, showcasing notable improvements in DICE scores across diverse segmentation tasks like brain (63.46%), lung (50.11%), and fetal head (30.82%), when compared to un prompted SAM. Code and text prompts are available at: this https URL .
- [1006] arXiv:2404.06369 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VISION2UI: A Real-World Dataset with Layout for Code Generation from UI DesignsYi Gui , Zhen Li , Yao Wan , Yemin Shi , Hongyu Zhang , Yi Su , Shaoling Dong , Xing Zhou , Wenbin JiangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: Automatically generating UI code from webpage design visions can significantly alleviate the burden of developers, enabling beginner developers or designers to directly generate Web pages from design diagrams. Currently, prior research has accomplished the objective of generating UI code from rudimentary design visions or sketches through designing deep neural networks. Inspired by the groundbreaking advancements achieved by Multimodal Large Language Models (MLLMs), the automatic generation of UI code from high-fidelity design images is now emerging as a viable possibility. Nevertheless, our investigation reveals that existing MLLMs are hampered by the scarcity of authentic, high-quality, and large-scale datasets, leading to unsatisfactory performance in automated UI code generation. To mitigate this gap, we present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information, tailored specifically for finetuning MLLMs in UI code generation. Specifically, this dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset. In order to uphold its quality, a neural scorer trained on labeled samples is utilized to refine the data, retaining higher-quality instances. Ultimately, this process yields a dataset comprising 2,000 (Much more is coming soon) parallel samples encompassing design visions and UI code. The dataset is available at this https URL .
- [1007] arXiv:2404.06392 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Event Extraction in Basque: Typologically motivated Cross-Lingual Transfer-Learning AnalysisComments: Accepted at LREC-Coling 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages and involves a Multilingual Language Model that is trained in a source language and applied to the target language. This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer, an under-explored topic. We first focus on Basque as the target language, which is an ideal target language because it is typologically different from surrounding languages. Our experiments on three Event Extraction tasks show that the shared linguistic characteristic between source and target languages does have an impact on transfer quality. Further analysis of 72 language pairs reveals that for tasks that involve token classification such as entity and event trigger identification, common writing script and morphological features produce higher quality cross-lingual transfer. In contrast, for tasks involving structural prediction like argument extraction, common word order is the most relevant feature. In addition, we show that when increasing the training size, not all the languages scale in the same way in the cross-lingual setting. To perform the experiments we introduce EusIE, an event extraction dataset for Basque, which follows the Multilingual Event Extraction dataset (MEE). The dataset and code are publicly available.
- [1008] arXiv:2404.06393 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: MuPT: A Generative Symbolic Music Pretrained TransformerXingwei Qu , Yuelin Bai , Yinghao Ma , Ziya Zhou , Ka Man Lo , Jiaheng Liu , Ruibin Yuan , Lejun Min , Xueling Liu , Tianyu Zhang , Xinrun Du , Shuyue Guo , Yiming Liang , Yizhi Li , Shangda Wu , Junting Zhou , Tianyu Zheng , Ziyang Ma , Fengze Han , Wei Xue , Gus Xia , Emmanouil Benetos , Xiang Yue , Chenghua Lin , Xu Tan , Stephen W. Huang , Wenhu Chen , Jie Fu , Ge ZhangSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
- [1009] arXiv:2404.06404 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Apprentices to Research Assistants: Advancing Research with Large Language ModelsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have emerged as powerful tools in various research domains. This article examines their potential through a literature review and firsthand experimentation. While LLMs offer benefits like cost-effectiveness and efficiency, challenges such as prompt tuning, biases, and subjectivity must be addressed. The study presents insights from experiments utilizing LLMs for qualitative analysis, highlighting successes and limitations. Additionally, it discusses strategies for mitigating challenges, such as prompt optimization techniques and leveraging human expertise. This study aligns with the 'LLMs as Research Tools' workshop's focus on integrating LLMs into HCI data work critically and ethically. By addressing both opportunities and challenges, our work contributes to the ongoing dialogue on their responsible application in research.
- [1010] arXiv:2404.06407 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Rethinking How to Evaluate Language Model JailbreakSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.
- [1011] arXiv:2404.06418 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Studying the Impact of Latent Representations in Implicit Neural Networks for Scientific Continuous Field ReconstructionWei Xu , Derek Freeman DeSantis , Xihaier Luo , Avish Parmar , Klaus Tan , Balu Nadiga , Yihui Ren , Shinjae YooSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Learning a continuous and reliable representation of physical fields from sparse sampling is challenging and it affects diverse scientific disciplines. In a recent work, we present a novel model called MMGN (Multiplicative and Modulated Gabor Network) with implicit neural networks. In this work, we design additional studies leveraging explainability methods to complement the previous experiments and further enhance the understanding of latent representations generated by the model. The adopted methods are general enough to be leveraged for any latent space inspection. Preliminary results demonstrate the contextual information incorporated in the latent representations and their impact on the model performance. As a work in progress, we will continue to verify our findings and develop novel explainability approaches.
- [1012] arXiv:2404.06423 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Deep Reinforcement Learning-Based Approach for a Single Vehicle Persistent Surveillance Problem with Fuel ConstraintsComments: 6 pagesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This article presents a deep reinforcement learning-based approach to tackle a persistent surveillance mission requiring a single unmanned aerial vehicle initially stationed at a depot with fuel or time-of-flight constraints to repeatedly visit a set of targets with equal priority. Owing to the vehicle's fuel or time-of-flight constraints, the vehicle must be regularly refueled, or its battery must be recharged at the depot. The objective of the problem is to determine an optimal sequence of visits to the targets that minimizes the maximum time elapsed between successive visits to any target while ensuring that the vehicle never runs out of fuel or charge. We present a deep reinforcement learning algorithm to solve this problem and present the results of numerical experiments that corroborate the effectiveness of this approach in comparison with common-sense greedy heuristics.
- [1013] arXiv:2404.06429 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Magic-Boost: Boost 3D Generation with Mutli-View Conditioned DiffusionFan Yang , Jianfeng Zhang , Yichun Shi , Bowen Chen , Chenxu Zhang , Huichao Zhang , Xiaofeng Yang , Jiashi Feng , Guosheng LinSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Benefiting from the rapid development of 2D diffusion models, 3D content creation has made significant progress recently. One promising solution involves the fine-tuning of pre-trained 2D diffusion models to harness their capacity for producing multi-view images, which are then lifted into accurate 3D models via methods like fast-NeRFs or large reconstruction models. However, as inconsistency still exists and limited generated resolution, the generation results of such methods still lack intricate textures and complex geometries. To solve this problem, we propose Magic-Boost, a multi-view conditioned diffusion model that significantly refines coarse generative results through a brief period of SDS optimization ($\sim15$min). Compared to the previous text or single image based diffusion models, Magic-Boost exhibits a robust capability to generate images with high consistency from pseudo synthesized multi-view images. It provides precise SDS guidance that well aligns with the identity of the input images, enriching the local detail in both geometry and texture of the initial generative results. Extensive experiments show Magic-Boost greatly enhances the coarse inputs and generates high-quality 3D assets with rich geometric and textural details. (Project Page: this https URL )
- [1014] arXiv:2404.06430 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: pfl-research: simulation framework for accelerating research in Private Federated LearningFilip Granqvist , Congzheng Song , Áine Cahill , Rogier van Dalen , Martin Pelikan , Yi Sheng Chan , Xiaojun Feng , Natarajan Krishnaswami , Vojta Jina , Mona ChitnisSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Federated learning (FL) is an emerging machine learning (ML) training paradigm where clients own their data and collaborate to train a global model, without revealing any data to the server and other participants. Researchers commonly perform experiments in a simulation environment to quickly iterate on ideas. However, existing open-source tools do not offer the efficiency required to simulate FL on larger and more realistic FL datasets. We introduce pfl-research, a fast, modular, and easy-to-use Python framework for simulating FL. It supports TensorFlow, PyTorch, and non-neural network models, and is tightly integrated with state-of-the-art privacy algorithms. We study the speed of open-source FL frameworks and show that pfl-research is 7-72$\times$ faster than alternative open-source frameworks on common cross-device setups. Such speedup will significantly boost the productivity of the FL research community and enable testing hypotheses on realistic FL datasets that were previously too resource intensive. We release a suite of benchmarks that evaluates an algorithm's overall performance on a diverse set of realistic scenarios. The code is available on GitHub at this https URL .
- [1015] arXiv:2404.06448 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Automated Federated Pipeline for Parameter-Efficient Fine-Tuning of Large Language ModelsComments: 15 pages, 16 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Recently, there has been a surge in the development of advanced intelligent generative content (AIGC), especially large language models (LLMs). However, for many downstream tasks, it is necessary to fine-tune LLMs using private data. While federated learning offers a promising privacy-preserving solution to LLM fine-tuning, the substantial size of an LLM, combined with high computational and communication demands, makes it hard to apply to downstream tasks. More importantly, private edge servers often possess varying computing and network resources in real-world scenarios, introducing additional complexities to LLM fine-tuning. To tackle these problems, we design and implement an automated federated pipeline, named FedPipe, to fine-tune LLMs with minimal training cost but without adding any inference latency. FedPipe firstly identifies the weights to be fine-tuned based on their contributions to the LLM training. It then configures a low-rank adapter for each selected weight to train local low-rank adapters on an edge server, and aggregate local adapters of all edge servers to fine-tune the whole LLM. Finally, it appropriately quantizes the parameters of LLM to reduce memory space according to the requirements of edge servers. Extensive experiments demonstrate that FedPipe expedites the model training and achieves higher accuracy than state-of-the-art benchmarks.
- [1016] arXiv:2404.06453 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant CircuitsComments: 14 pages (4 pages manuscript, 2 pages references, 8 pages appendix)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at this https URL .
- [1017] arXiv:2404.06479 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Text-Based Reasoning About Vector GraphicsComments: Project page: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: While large multimodal models excel in broad vision-language benchmarks, they often struggle with tasks requiring precise perception of low-level visual details, such as comparing line lengths or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphics -- images composed purely of 2D objects and shapes. To address this challenge, we propose the Visually Descriptive Language Model (VDLM), which performs text-based reasoning about vector graphics. VDLM leverages Scalable Vector Graphics (SVG) for a more precise visual description and first uses an off-the-shelf raster-to-SVG algorithm for encoding. Since existing language models cannot understand raw SVGs in a zero-shot setting, VDLM then bridges SVG with pretrained language models through a newly introduced intermediate symbolic representation, Primal Visual Description (PVD), comprising primitive attributes (e.g., shape, position, measurement) with their corresponding predicted values. PVD is task-agnostic and represents visual primitives that are universal across all vector graphics. It can be learned with procedurally generated (SVG, PVD) pairs and also enables the direct use of LLMs for generalization to complex reasoning tasks. By casting an image to a text-based representation, we can leverage the power of language models to learn alignment from SVG to visual primitives and generalize to unseen question-answering tasks. Empirical results show that VDLM achieves stronger zero-shot performance compared to state-of-the-art LMMs, such as GPT-4V, in various low-level multimodal perception and reasoning tasks on vector graphics. We additionally present extensive analyses on VDLM's performance, demonstrating that our framework offers better interpretability due to its disentangled perception and reasoning processes. Project page: this https URL
- [1018] arXiv:2404.06480 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarksComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at this https URL .
- [1019] arXiv:2404.06484 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: Public-private funding models in open source software development: A case study on scikit-learnComments: 15 pagesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Governments are increasingly funding open source software (OSS) development to support software security, digital sovereignty, and national competitiveness in science and innovation, amongst others. However, little is known about how OSS developers evaluate the relative benefits and drawbacks of governmental funding for OSS. This study explores this question through a case study on scikit-learn, a Python library for machine learning, funded by public research grants, commercial sponsorship, micro-donations, and a 32 euro million grant announced in France's artificial intelligence strategy. Through 25 interviews with scikit-learn's maintainers and funders, this study makes two key contributions. First, it contributes empirical findings about the benefits and drawbacks of public and private funding in an impactful OSS project, and the governance protocols employed by the maintainers to balance the diverse interests of their community and funders. Second, it offers practical lessons on funding for OSS developers, governments, and companies based on the experience of scikit-learn. The paper concludes with key recommendations for practitioners and future research directions.
- [1020] arXiv:2404.06488 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Pitfalls of Conversational LLMs on News DebiasingComments: The paper is accepted at the DELITE workshop which is co-located at COLING/LRECSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper addresses debiasing in news editing and evaluates the effectiveness of conversational Large Language Models in this task. We designed an evaluation checklist tailored to news editors' perspectives, obtained generated texts from three popular conversational models using a subset of a publicly available dataset in media bias, and evaluated the texts according to the designed checklist. Furthermore, we examined the models as evaluator for checking the quality of debiased model outputs. Our findings indicate that none of the LLMs are perfect in debiasing. Notably, some models, including ChatGPT, introduced unnecessary changes that may impact the author's style and create misinformation. Lastly, we show that the models do not perform as proficiently as domain experts in evaluating the quality of debiased outputs.
- [1021] arXiv:2404.06492 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph Reinforcement Learning for Combinatorial Optimization: A Survey and Unifying PerspectiveSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graphs are a natural representation for systems based on relations between connected entities. Combinatorial optimization problems, which arise when considering an objective function related to a process of interest on discrete structures, are often challenging due to the rapid growth of the solution space. The trial-and-error paradigm of Reinforcement Learning has recently emerged as a promising alternative to traditional methods, such as exact algorithms and (meta)heuristics, for discovering better decision-making strategies in a variety of disciplines including chemistry, computer science, and statistics. Despite the fact that they arose in markedly different fields, these techniques share significant commonalities. Therefore, we set out to synthesize this work in a unifying perspective that we term Graph Reinforcement Learning, interpreting it as a constructive decision-making method for graph problems. After covering the relevant technical background, we review works along the dividing line of whether the goal is to optimize graph structure given a process of interest, or to optimize the outcome of the process itself under fixed graph structure. Finally, we discuss the common challenges facing the field and open research questions. In contrast with other surveys, the present work focuses on non-canonical graph problems for which performant algorithms are typically not known and Reinforcement Learning is able to provide efficient and effective solutions.
- [1022] arXiv:2404.06511 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MoReVQA: Exploring Modular Reasoning Models for Video Question AnsweringComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
- [1023] arXiv:2404.06519 (cross-list from cs.GT) [ pdf , ps , other ]
-
Title: Best Response ShapingSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: We investigate the challenge of multi-agent deep reinforcement learning in partially competitive environments, where traditional methods struggle to foster reciprocity-based cooperation. LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. However, there is a key limitation in these techniques. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approximating the best response, termed the "detective." To condition the detective on the agent's policy for complex games we propose a state-aware differentiable conditioning mechanism, facilitated by a question answering (QA) method that extracts a representation of the agent based on its behaviour on specific environment states. To empirically validate our method, we showcase its enhanced performance against a Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to the best response in the Coin Game. This work expands the applicability of multi-agent RL in partially competitive environments and provides a new pathway towards achieving improved social welfare in general sum games.
- [1024] arXiv:2404.06524 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: An Enhanced Grey Wolf Optimizer with Elite Inheritance and Balance Search MechanismsComments: 51 pages, 21 tables, 16 figures, journalSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: The Grey Wolf Optimizer (GWO) is recognized as a novel meta-heuristic algorithm inspired by the social leadership hierarchy and hunting mechanism of grey wolves. It is well-known for its simple parameter setting, fast convergence speed, and strong optimization capability. In the original GWO, there are two significant design flaws in its fundamental optimization mechanisms. Problem (1): the algorithm fails to inherit from elite positions from the last iteration when generating the next positions of the wolf population, potentially leading to suboptimal solutions. Problem (2): the positions of the population are updated based on the central position of the three leading wolves (alpha, beta, delta), without a balanced mechanism between local and global search. To tackle these problems, an enhanced Grey Wolf Optimizer with Elite Inheritance Mechanism and Balance Search Mechanism, named as EBGWO, is proposed to improve the effectiveness of the position updating and the quality of the convergence solutions. The IEEE CEC 2014 benchmark functions suite and a series of simulation tests are employed to evaluate the performance of the proposed algorithm. The simulation tests involve a comparative study between EBGWO, three GWO variants, GWO and two well-known meta-heuristic algorithms. The experimental results demonstrate that the proposed EBGWO algorithm outperforms other meta-heuristic algorithms in both accuracy and convergence speed. Three engineering optimization problems are adopted to prove its capability in processing real-world problems. The results indicate that the proposed EBGWO outperforms several popular algorithms.
- [1025] arXiv:2404.06529 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Emergent Braitenberg-style Behaviours for Navigating the ViZDoom `My Way Home' LabyrinthSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: The navigation of complex labyrinths with tens of rooms under visual partially observable state is typically addressed using recurrent deep reinforcement learning architectures. In this work, we show that navigation can be achieved through the emergent evolution of a simple Braitentberg-style heuristic that structures the interaction between agent and labyrinth, i.e. complex behaviour from simple heuristics. To do so, the approach of tangled program graphs is assumed in which programs cooperatively coevolve to develop a modular indexing scheme that only employs 0.8\% of the state space. We attribute this simplicity to several biases implicit in the representation, such as the use of pixel indexing as opposed to deploying a convolutional kernel or image processing operators.
- [1026] arXiv:2404.06561 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Learning Strategies For Successful Crowd NavigationComments: 8 pagesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Teaching autonomous mobile robots to successfully navigate human crowds is a challenging task. Not only does it require planning, but it requires maintaining social norms which may differ from one context to another. Here we focus on crowd navigation, using a neural network to learn specific strategies in-situ with a robot. This allows us to take into account human behavior and reactions toward a real robot as well as learn strategies that are specific to various scenarios in that context. A CNN takes a top-down image of the scene as input and outputs the next action for the robot to take in terms of speed and angle. Here we present the method, experimental results, and quantitatively evaluate our approach.
- [1027] arXiv:2404.06579 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Less is More for Improving Automatic Evaluation of Factual ConsistencyComments: Accepted in NAACL24 Industry; 7 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Assessing the factual consistency of automatically generated texts in relation to source context is crucial for developing reliable natural language generation applications. Recent literature proposes AlignScore which uses a unified alignment model to evaluate factual consistency and substantially outperforms previous methods across many benchmark tasks. In this paper, we take a closer look of datasets used in AlignScore and uncover an unexpected finding: utilizing a smaller number of data points can actually improve performance. We process the original AlignScore training dataset to remove noise, augment with robustness-enhanced samples, and utilize a subset comprising 10\% of the data to train an improved factual consistency evaluation model, we call LIM-RA (Less Is More for Robust AlignScore). LIM-RA demonstrates superior performance, consistently outperforming AlignScore and other strong baselines like ChatGPT across four benchmarks (two utilizing traditional natural language generation datasets and two focused on large language model outputs). Our experiments show that LIM-RA achieves the highest score on 24 of the 33 test datasets, while staying competitive on the rest, establishing the new state-of-the-art benchmarks.
- [1028] arXiv:2404.06593 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Spatially Optimized Compact Deep Metric Learning Model for Similarity SearchMd. Farhadul Islam , Md. Tanzim Reza , Meem Arafat Manab , Mohammad Rakibul Hasan Mahin , Sarah Zabeen , Jannatun NoorComments: 5 pages, 3 figures,Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Spatial optimization is often overlooked in many computer vision tasks. Filters should be able to recognize the features of an object regardless of where it is in the image. Similarity search is a crucial task where spatial features decide an important output. The capacity of convolution to capture visual patterns across various locations is limited. In contrast to convolution, the involution kernel is dynamically created at each pixel based on the pixel value and parameters that have been learned. This study demonstrates that utilizing a single layer of involution feature extractor alongside a compact convolution model significantly enhances the performance of similarity search. Additionally, we improve predictions by using the GELU activation function rather than the ReLU. The negligible amount of weight parameters in involution with a compact model with better performance makes the model very useful in real-world implementations. Our proposed model is below 1 megabyte in size. We have experimented with our proposed methodology and other models on CIFAR-10, FashionMNIST, and MNIST datasets. Our proposed method outperforms across all three datasets.
- [1029] arXiv:2404.06599 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FMDA-OT: Federated Multi-source Domain Adaptation Through Optimal TransportSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Multi-source Domain Adaptation (MDA) aims to adapt models trained on multiple labeled source domains to an unlabeled target domain. In this paper, we introduce our approach as a collaborative MDA framework, which comprises two adaptation phases. Firstly, we conduct domain adaptation for each source individually with the target, utilizing optimal transport. Then, in the second phase, which constitutes the final part of the framework, we design the architecture of centralized federated learning to collaborate the N models representing the N sources. This architecture offers the advantage of using the sources without accessing their data, thus resolving data privacy issues inherent in domain adaptation. Additionally, during this phase, the server guides and fine-tunes the adaptation using a small number of pseudo-labeled samples available in the target domain, referred to as the target validation subset of the dataset.
- [1030] arXiv:2404.06631 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Counting Objects in a Robotic HandSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: A robot performing multi-object grasping needs to sense the number of objects in the hand after grasping. The count plays an important role in determining the robot's next move and the outcome and efficiency of the whole pick-place process. This paper presents a data-driven contrastive learning-based counting classifier with a modified loss function as a simple and effective approach for object counting despite significant occlusion challenges caused by robotic fingers and objects. The model was validated against other models with three different common shapes (spheres, cylinders, and cubes) in simulation and in a real setup. The proposed contrastive learning-based counting approach achieved above 96\% accuracy for all three objects in the real setup.
- [1031] arXiv:2404.06633 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Evolving Loss Functions for Specific Image Augmentation TechniquesSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Previous work in Neural Loss Function Search (NLFS) has shown a lack of correlation between smaller surrogate functions and large convolutional neural networks with massive regularization. We expand upon this research by revealing another disparity that exists, correlation between different types of image augmentation techniques. We show that different loss functions can perform well on certain image augmentation techniques, while performing poorly on others. We exploit this disparity by performing an evolutionary search on five types of image augmentation techniques in the hopes of finding image augmentation specific loss functions. The best loss functions from each evolution were then taken and transferred to WideResNet-28-10 on CIFAR-10 and CIFAR-100 across each of the five image augmentation techniques. The best from that were then taken and evaluated by fine-tuning EfficientNetV2Small on the CARS, Oxford-Flowers, and Caltech datasets across each of the five image augmentation techniques. Multiple loss functions were found that outperformed cross-entropy across multiple experiments. In the end, we found a single loss function, which we called the inverse bessel logarithm loss, that was able to outperform cross-entropy across the majority of experiments.
- [1032] arXiv:2404.06641 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Federated learning model for predicting major postoperative complicationsYonggi Park , Yuanfang Ren , Benjamin Shickel , Ziyuan Guan , Ayush Patela , Yingbo Ma , Zhenhong Hu , Tyler J. Loftus , Parisa Rashidi , Tezcan Ozrazgat-Baslanti , Azra BihoracComments: 57 pages. 2 figures, 3 tables, 2 supplemental figures, 8 supplemental tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Background: The accurate prediction of postoperative complication risk using Electronic Health Records (EHR) and artificial intelligence shows great potential. Training a robust artificial intelligence model typically requires large-scale and diverse datasets. In reality, collecting medical data often encounters challenges surrounding privacy protection. Methods: This retrospective cohort study includes adult patients who were admitted to UFH Gainesville (GNV) (n = 79,850) and Jacksonville (JAX) (n = 28,636) for any type of inpatient surgical procedure. Using perioperative and intraoperative features, we developed federated learning models to predict nine major postoperative complications (i.e., prolonged intensive care unit stay and mechanical ventilation). We compared federated learning models with local learning models trained on a single site and central learning models trained on pooled dataset from two centers. Results: Our federated learning models achieved the area under the receiver operating characteristics curve (AUROC) values ranged from 0.81 for wound complications to 0.92 for prolonged ICU stay at UFH GNV center. At UFH JAX center, these values ranged from 0.73-0.74 for wound complications to 0.92-0.93 for hospital mortality. Federated learning models achieved comparable AUROC performance to central learning models, except for prolonged ICU stay, where the performance of federated learning models was slightly higher than central learning models at UFH GNV center, but slightly lower at UFH JAX center. In addition, our federated learning model obtained comparable performance to the best local learning model at each center, demonstrating strong generalizability. Conclusion: Federated learning is shown to be a useful tool to train robust and generalizable models from large scale data across multiple institutions where data protection barriers are high.
- [1033] arXiv:2404.06644 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?Omid Ghahroodi , Marzia Nouri , Mohammad Vali Sanian , Alireza Sahebi , Doratossadat Dastgheib , Ehsaneddin Asgari , Mahdieh Soleymani Baghshah , Mohammad Hossein RohbanSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.
- [1034] arXiv:2404.06645 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation TasksComments: 14 pages, 12 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have been successful at generating robot policy code, but so far these results have been limited to high-level tasks that do not require precise movement. It is an open question how well such approaches work for tasks that require reasoning over contact forces and working within tight success tolerances. We find that, with the right action space, LLMs are capable of successfully generating policies for a variety of contact-rich and high-precision manipulation tasks, even under noisy conditions, such as perceptual errors or grasping inaccuracies. Specifically, we reparameterize the action space to include compliance with constraints on the interaction forces and stiffnesses involved in reaching a target pose. We validate this approach on subtasks derived from the Functional Manipulation Benchmark (FMB) and NIST Task Board Benchmarks. Exposing this action space alongside methods for estimating object poses improves policy generation with an LLM by greater than 3x and 4x when compared to non-compliant action spaces
- [1035] arXiv:2404.06647 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning RevolutionSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Over the past decade, AI research has focused heavily on building ever-larger deep learning models. This approach has simultaneously unlocked incredible achievements in science and technology, and hindered AI from overcoming long-standing limitations with respect to explainability, ethical harms, and environmental efficiency. Drawing on qualitative interviews and computational analyses, our three-part history of AI research traces the creation of this "epistemic monoculture" back to a radical reconceptualization of scientific progress that began in the late 1980s. In the first era of AI research (1950s-late 1980s), researchers and patrons approached AI as a "basic" science that would advance through autonomous exploration and organic assessments of progress (e.g., peer-review, theoretical consensus). The failure of this approach led to a retrenchment of funding in the 1980s. Amid this "AI Winter," an intervention by the U.S. government reoriented the field towards measurable progress on tasks of military and commercial interest. A new evaluation system called "benchmarking" provided an objective way to quantify progress on tasks by focusing exclusively on increasing predictive accuracy on example datasets. Distilling science down to verifiable metrics clarified the roles of scientists, allowed the field to rapidly integrate talent, and provided clear signals of significance and progress. But history has also revealed a tradeoff to this streamlined approach to science: the consolidation around external interests and inherent conservatism of benchmarking has disincentivized exploration beyond scaling monoculture. In the discussion, we explain how AI's monoculture offers a compelling challenge to the belief that basic, exploration-driven research is needed for scientific progress. Implications for the spread of AI monoculture to other sciences in the era of generative AI are also discussed.
- [1036] arXiv:2404.06664 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural KnowledgeYu Ying Chiu , Liwei Jiang , Maria Antoniak , Chan Young Park , Shuyue Stella Li , Mehar Bhatia , Sahithya Ravi , Yulia Tsvetkov , Vered Shwartz , Yejin ChoiComments: Preprint (under review)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
- [1037] arXiv:2404.06666 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SafeGen: Mitigating Unsafe Content Generation in Text-to-Image ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Abstract: Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block explicit NSFW-related content (e.g., naked or sexy) but may still be vulnerable to adversarial prompts inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets demonstrate SafeGen's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.
- [1038] arXiv:2404.06668 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Forecasting the Future with Future Technologies: Advancements in Large Meteorological ModelsComments: 5 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: The field of meteorological forecasting has undergone a significant transformation with the integration of large models, especially those employing deep learning techniques. This paper reviews the advancements and applications of these models in weather prediction, emphasizing their role in transforming traditional forecasting methods. Models like FourCastNet, Pangu-Weather, GraphCast, ClimaX, and FengWu have made notable contributions by providing accurate, high-resolution forecasts, surpassing the capabilities of traditional Numerical Weather Prediction (NWP) models. These models utilize advanced neural network architectures, such as Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and Transformers, to process diverse meteorological data, enhancing predictive accuracy across various time scales and spatial resolutions. The paper addresses challenges in this domain, including data acquisition and computational demands, and explores future opportunities for model optimization and hardware advancements. It underscores the integration of artificial intelligence with conventional meteorological techniques, promising improved weather prediction accuracy and a significant contribution to addressing climate-related challenges. This synergy positions large models as pivotal in the evolving landscape of meteorological forecasting.
- [1039] arXiv:2404.06674 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice EditingPhilip Anastassiou , Zhenyu Tang , Kainan Peng , Dongya Jia , Jiaxin Li , Ming Tu , Yuping Wang , Yuxuan Wang , Mingbo MaSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{ this https URL }.
- [1040] arXiv:2404.06679 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint EvolutionSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: A major contributor to the quality of a deep learning model is the selection of the optimizer. We propose a new dual-joint search space in the realm of neural optimizer search (NOS), along with an integrity check, to automate the process of finding deep learning optimizers. Our dual-joint search space simultaneously allows for the optimization of not only the update equation, but also internal decay functions and learning rate schedules for optimizers. We search the space using our proposed mutation-only, particle-based genetic algorithm able to be massively parallelized for our domain-specific problem. We evaluate our candidate optimizers on the CIFAR-10 dataset using a small ConvNet. To assess generalization, the final optimizers were then transferred to large-scale image classification on CIFAR- 100 and TinyImageNet, while also being fine-tuned on Flowers102, Cars196, and Caltech101 using EfficientNetV2Small. We found multiple optimizers, learning rate schedules, and Adam variants that outperformed Adam, as well as other standard deep learning optimizers, across the image classification tasks.
- [1041] arXiv:2404.06690 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker ConversationsLeying Zhang , Yao Qian , Long Zhou , Shujie Liu , Dongmei Wang , Xiaofei Wang , Midia Yousefi , Yanmin Qian , Jinyu Li , Lei He , Sheng Zhao , Michael ZengSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Abstract: Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge in the field. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix is capable of first converting dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. These dialogues, generated within a single channel, are characterized by seamless speech transitions, including overlapping speech, and appropriate paralinguistic behaviors such as laughter. Audio samples are available at this https URL .
- [1042] arXiv:2404.06694 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: How to Craft Backdoors with Unlabeled Data Alone?Comments: Accepted at ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models (DPFM)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Relying only on unlabeled data, Self-supervised learning (SSL) can learn rich features in an economical and scalable way. As the drive-horse for building foundation models, SSL has received a lot of attention recently with wide applications, which also raises security concerns where backdoor attack is a major type of threat: if the released dataset is maliciously poisoned, backdoored SSL models can behave badly when triggers are injected to test samples. The goal of this work is to investigate this potential risk. We notice that existing backdoors all require a considerable amount of \emph{labeled} data that may not be available for SSL. To circumvent this limitation, we explore a more restrictive setting called no-label backdoors, where we only have access to the unlabeled data alone, where the key challenge is how to select the proper poison set without using label information. We propose two strategies for poison selection: clustering-based selection using pseudolabels, and contrastive selection derived from the mutual information principle. Experiments on CIFAR-10 and ImageNet-100 show that both no-label backdoors are effective on many SSL methods and outperform random poisoning by a large margin. Code will be available at this https URL .
- [1043] arXiv:2404.06704 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Convolution-based Probability Gradient Loss for Semantic SegmentationComments: 12 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we introduce a novel Convolution-based Probability Gradient (CPG) loss for semantic segmentation. It employs convolution kernels similar to the Sobel operator, capable of computing the gradient of pixel intensity in an image. This enables the computation of gradients for both ground-truth and predicted category-wise probabilities. It enhances network performance by maximizing the similarity between these two probability gradients. Moreover, to specifically enhance accuracy near the object's boundary, we extract the object boundary based on the ground-truth probability gradient and exclusively apply the CPG loss to pixels belonging to boundaries. CPG loss proves to be highly convenient and effective. It establishes pixel relationships through convolution, calculating errors from a distinct dimension compared to pixel-wise loss functions such as cross-entropy loss. We conduct qualitative and quantitative analyses to evaluate the impact of the CPG loss on three well-established networks (DeepLabv3-Resnet50, HRNetV2-OCR, and LRASPP_MobileNet_V3_Large) across three standard segmentation datasets (Cityscapes, COCO-Stuff, ADE20K). Our extensive experimental results consistently and significantly demonstrate that the CPG loss enhances the mean Intersection over Union.
- [1044] arXiv:2404.06710 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SpikeNVS: Enhancing Novel View Synthesis from Blurry Images via Spike CameraGaole Dai , Zhenyu Wang , Qinwen Xu , Ming Lu , Wen Chen , Boxin Shi , Shanghang Zhang , Tiejun HuangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: One of the most critical factors in achieving sharp Novel View Synthesis (NVS) using neural field methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) is the quality of the training images. However, Conventional RGB cameras are susceptible to motion blur. In contrast, neuromorphic cameras like event and spike cameras inherently capture more comprehensive temporal information, which can provide a sharp representation of the scene as additional training data. Recent methods have explored the integration of event cameras to improve the quality of NVS. The event-RGB approaches have some limitations, such as high training costs and the inability to work effectively in the background. Instead, our study introduces a new method that uses the spike camera to overcome these limitations. By considering texture reconstruction from spike streams as ground truth, we design the Texture from Spike (TfS) loss. Since the spike camera relies on temporal integration instead of temporal differentiation used by event cameras, our proposed TfS loss maintains manageable training costs. It handles foreground objects with backgrounds simultaneously. We also provide a real-world dataset captured with our spike-RGB camera system to facilitate future research endeavors. We conduct extensive experiments using synthetic and real-world datasets to demonstrate that our design can enhance novel view synthesis across NeRF and 3DGS. The code and dataset will be made available for public access.
- [1045] arXiv:2404.06717 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They RepresentSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Racial diversity has become increasingly discussed within the AI and algorithmic fairness literature, yet little attention is focused on justifying the choices of racial categories and understanding how people are racialized into these chosen racial categories. Even less attention is given to how racial categories shift and how the racialization process changes depending on the context of a dataset or model. An unclear understanding of \textit{who} comprises the racial categories chosen and \textit{how} people are racialized into these categories can lead to varying interpretations of these categories. These varying interpretations can lead to harm when the understanding of racial categories and the racialization process is misaligned from the actual racialization process and racial categories used. Harm can also arise if the racialization process and racial categories used are irrelevant or do not exist in the context they are applied.
In this paper, we make two contributions. First, we demonstrate how racial categories with unclear assumptions and little justification can lead to varying datasets that poorly represent groups obfuscated or unrepresented by the given racial categories and models that perform poorly on these groups. Second, we develop a framework, CIRCSheets, for documenting the choices and assumptions in choosing racial categories and the process of racialization into these categories to facilitate transparency in understanding the processes and assumptions made by dataset or model developers when selecting or using these racial categories. - [1046] arXiv:2404.06731 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Accuracy of a Large Language Model in Distinguishing Anti- And Pro-vaccination Messages on Social Media: The Case of Human Papillomavirus VaccinationComments: Forthcoming in Preventive Medicine ReportsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Objective. Vaccination has engendered a spectrum of public opinions, with social media acting as a crucial platform for health-related discussions. The emergence of artificial intelligence technologies, such as large language models (LLMs), offers a novel opportunity to efficiently investigate public discourses. This research assesses the accuracy of ChatGPT, a widely used and freely available service built upon an LLM, for sentiment analysis to discern different stances toward Human Papillomavirus (HPV) vaccination. Methods. Messages related to HPV vaccination were collected from social media supporting different message formats: Facebook (long format) and Twitter (short format). A selection of 1,000 human-evaluated messages was input into the LLM, which generated multiple response instances containing its classification results. Accuracy was measured for each message as the level of concurrence between human and machine decisions, ranging between 0 and 1. Results. Average accuracy was notably high when 20 response instances were used to determine the machine decision of each message: .882 (SE = .021) and .750 (SE = .029) for anti- and pro-vaccination long-form; .773 (SE = .027) and .723 (SE = .029) for anti- and pro-vaccination short-form, respectively. Using only three or even one instance did not lead to a severe decrease in accuracy. However, for long-form messages, the language model exhibited significantly lower accuracy in categorizing pro-vaccination messages than anti-vaccination ones. Conclusions. ChatGPT shows potential in analyzing public opinions on HPV vaccination using social media content. However, understanding the characteristics and limitations of a language model within specific public health contexts remains imperative.
- [1047] arXiv:2404.06733 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Incremental XAI: Memorable Understanding of AI with Incremental ExplanationsComments: CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Many explainable AI (XAI) techniques strive for interpretability by providing concise salient information, such as sparse linear factors. However, users either only see inaccurate global explanations, or highly-varying local explanations. We propose to provide more detailed explanations by leveraging the human cognitive capacity to accumulate knowledge by incrementally receiving more details. Focusing on linear factor explanations (factors $\times$ values = outcome), we introduce Incremental XAI to automatically partition explanations for general and atypical instances by providing Base + Incremental factors to help users read and remember more faithful explanations. Memorability is improved by reusing base factors and reducing the number of factors shown in atypical cases. In modeling, formative, and summative user studies, we evaluated the faithfulness, memorability and understandability of Incremental XAI against baseline explanation methods. This work contributes towards more usable explanation that users can better ingrain to facilitate intuitive engagement with AI.
- [1048] arXiv:2404.06750 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Frontier AI Ethics: Anticipating and Evaluating the Societal Impacts of Generative AgentsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Some have criticised Generative AI Systems for replicating the familiar pathologies of already widely-deployed AI systems. Other critics highlight how they foreshadow vastly more powerful future systems, which might threaten humanity's survival. The first group says there is nothing new here; the other looks through the present to a perhaps distant horizon. In this paper, I instead pay attention to what makes these particular systems distinctive: both their remarkable scientific achievement, and the most likely and consequential ways in which they will change society over the next five to ten years. In particular, I explore the potential societal impacts and normative questions raised by the looming prospect of 'Generative Agents', in which multimodal large language models (LLMs) form the executive centre of complex, tool-using AI systems that can take unsupervised sequences of actions towards some goal.
- [1049] arXiv:2404.06756 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: CrimeAlarm: Towards Intensive Intent Dynamics in Fine-grained Crime PredictionComments: Accepted by DASFAA 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Granularity and accuracy are two crucial factors for crime event prediction. Within fine-grained event classification, multiple criminal intents may alternately exhibit in preceding sequential events, and progress differently in next. Such intensive intent dynamics makes training models hard to capture unobserved intents, and thus leads to sub-optimal generalization performance, especially in the intertwining of numerous potential events. To capture comprehensive criminal intents, this paper proposes a fine-grained sequential crime prediction framework, CrimeAlarm, that equips with a novel mutual distillation strategy inspired by curriculum learning. During the early training phase, spot-shared criminal intents are captured through high-confidence sequence samples. In the later phase, spot-specific intents are gradually learned by increasing the contribution of low-confidence sequences. Meanwhile, the output probability distributions are reciprocally learned between prediction networks to model unobserved criminal intents. Extensive experiments show that CrimeAlarm outperforms state-of-the-art methods in terms of NDCG@5, with improvements of 4.51% for the NYC16 and 7.73% for the CHI18 in accuracy measures.
- [1050] arXiv:2404.06757 (cross-list from cs.DS) [ pdf , ps , html , other ]
-
Title: Language Generation in the LimitComments: 24 pages, 2 figuresSubjects: Data Structures and Algorithms (cs.DS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.
- [1051] arXiv:2404.06760 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent SpaceComments: LREC-COLING 2024 camera readySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In real-life conversations, the content is diverse, and there exists the one-to-many problem that requires diverse generation. Previous studies attempted to introduce discrete or Gaussian-based continuous latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision, and some attempts have been made in natural language processing. In this paper, we propose DiffusionDialog, a novel approach to enhance the diversity of dialogue generation with the help of diffusion model. In our approach, we introduce continuous latent variables into the diffusion model. The problem of using latent variables in the dialog task is how to build both an effective prior of the latent space and an inferring process to obtain the proper latent given the context. By combining the encoder and latent-based diffusion model, we encode the response's latent representation in a continuous space as the prior, instead of fixed Gaussian distribution or simply discrete ones. We then infer the latent by denoising step by step with the diffusion model. The experimental results show that our model greatly enhances the diversity of dialog responses while maintaining coherence. Furthermore, in further analysis, we find that our diffusion model achieves high inference efficiency, which is the main challenge of applying diffusion models in natural language processing.
- [1052] arXiv:2404.06776 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID DataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks. However, challenges arise due to vulnerability to adversarial examples (AEs) and the non-independent and identically distributed (non-IID) nature of data distribution among devices, hindering the deployment of adversarially robust and accurate learning models at the edge. While adversarial training (AT) is commonly acknowledged as an effective defense strategy against adversarial attacks in centralized training, we shed light on the adverse effects of directly applying AT in FL that can severely compromise accuracy, especially in non-IID challenges. Given this limitation, this paper proposes FatCC, which incorporates local logit \underline{C}alibration and global feature \underline{C}ontrast into the vanilla federated adversarial training (\underline{FAT}) process from both logit and feature perspectives. This approach can effectively enhance the federated system's robust accuracy (RA) and clean accuracy (CA). First, we propose logit calibration, where the logits are calibrated during local adversarial updates, thereby improving adversarial robustness. Second, FatCC introduces feature contrast, which involves a global alignment term that aligns each local representation with unbiased global features, thus further enhancing robustness and accuracy in federated adversarial environments. Extensive experiments across multiple datasets demonstrate that FatCC achieves comparable or superior performance gains in both CA and RA compared to other baselines.
- [1053] arXiv:2404.06787 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Private Wasserstein Distance with Random NoisesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Wasserstein distance is a principle measure of data divergence from a distributional standpoint. However, its application becomes challenging in the context of data privacy, where sharing raw data is restricted. Prior attempts have employed techniques like Differential Privacy or Federated optimization to approximate Wasserstein distance. Nevertheless, these approaches often lack accuracy and robustness against potential attack. In this study, we investigate the underlying triangular properties within the Wasserstein space, leading to a straightforward solution named TriangleWad. This approach enables the computation of Wasserstein distance between datasets stored across different entities. Notably, TriangleWad is 20 times faster, making raw data information truly invisible, enhancing resilience against attacks, and without sacrificing estimation accuracy. Through comprehensive experimentation across various tasks involving both image and text data, we demonstrate its superior performance and generalizations.
- [1054] arXiv:2404.06828 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Proposed modified computational model for the amoeba-inspired combinatorial optimization machineComments: 13 pages, 2 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD); Computation (stat.CO)
Abstract: A single-celled amoeba can solve the traveling salesman problem through its shape-changing dynamics. In this paper, we examine roles of several elements in a previously proposed computational model of the solution-search process of amoeba and three modifications towards enhancing the solution-search preformance. We find that appropriate modifications can indeed significantly improve the quality of solutions. It is also found that a condition associated with the volume conservation can also be modified in contrast to the naive belief that it is indispensable for the solution-search ability of amoeba. A proposed modified model shows much better performance.
- [1055] arXiv:2404.06859 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multi-Label Continual Learning for the Medical Domain: A Novel BenchmarkSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multi-label image classification in dynamic environments is a problem that poses significant challenges. Previous studies have primarily focused on scenarios such as Domain Incremental Learning and Class Incremental Learning, which do not fully capture the complexity of real-world applications. In this paper, we study the problem of classification of medical imaging in the scenario termed New Instances and New Classes, which combines the challenges of both new class arrivals and domain shifts in a single framework. Unlike traditional scenarios, it reflects the realistic nature of CL in domains such as medical imaging, where updates may introduce both new classes and changes in domain characteristics. To address the unique challenges posed by this complex scenario, we introduce a novel approach called Pseudo-Label Replay. This method aims to mitigate forgetting while adapting to new classes and domain shifts by combining the advantages of the Replay and Pseudo-Label methods and solving their limitations in the proposed scenario. We evaluate our proposed approach on a challenging benchmark consisting of two datasets, seven tasks, and nineteen classes, modeling a realistic Continual Learning scenario. Our experimental findings demonstrate the effectiveness of Pseudo-Label Replay in addressing the challenges posed by the complex scenario proposed. Our method surpasses existing approaches, exhibiting superior performance while showing minimal forgetting.
- [1056] arXiv:2404.06869 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SleepPPG-Net2: Deep learning generalization for sleep staging from photoplethysmographyShirel Attia , Revital Shani Hershkovich , Alissa Tabakhov , Angeleene Ang , Sharon Haimov , Riva Tauman , Joachim A. BeharSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Background: Sleep staging is a fundamental component in the diagnosis of sleep disorders and the management of sleep health. Traditionally, this analysis is conducted in clinical settings and involves a time-consuming scoring procedure. Recent data-driven algorithms for sleep staging, using the photoplethysmogram (PPG) time series, have shown high performance on local test sets but lower performance on external datasets due to data drift. Methods: This study aimed to develop a generalizable deep learning model for the task of four class (wake, light, deep, and rapid eye movement (REM)) sleep staging from raw PPG physiological time-series. Six sleep datasets, totaling 2,574 patients recordings, were used. In order to create a more generalizable representation, we developed and evaluated a deep learning model called SleepPPG-Net2, which employs a multi-source domain training approach.SleepPPG-Net2 was benchmarked against two state-of-the-art models. Results: SleepPPG-Net2 showed consistently higher performance over benchmark approaches, with generalization performance (Cohen's kappa) improving by up to 19%. Performance disparities were observed in relation to age, sex, and sleep apnea severity. Conclusion: SleepPPG-Net2 sets a new standard for staging sleep from raw PPG time-series.
- [1057] arXiv:2404.06883 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Research on Detection of Floating Objects in River and Lake Based on AI Intelligent Image RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: With the rapid advancement of artificial intelligence technology, AI-enabled image recognition has emerged as a potent tool for addressing challenges in traditional environmental monitoring. This study focuses on the detection of floating objects in river and lake environments, exploring an innovative approach based on deep learning. By intricately analyzing the technical pathways for detecting static and dynamic features and considering the characteristics of river and lake debris, a comprehensive image acquisition and processing workflow has been developed. The study highlights the application and performance comparison of three mainstream deep learning models -SSD, Faster-RCNN, and YOLOv5- in debris identification. Additionally, a detection system for floating objects has been designed and implemented, encompassing both hardware platform construction and software framework development. Through rigorous experimental validation, the proposed system has demonstrated its ability to significantly enhance the accuracy and efficiency of debris detection, thus offering a new technological avenue for water quality monitoring in rivers and lakes
- [1058] arXiv:2404.06903 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian SplattingShijie Zhou , Zhiwen Fan , Dejia Xu , Haoran Chang , Pradyumna Chari , Tejas Bharadwaj , Suya You , Zhangyang Wang , Achuta KadambiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: this http URL
- [1059] arXiv:2404.06910 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Superposition Prompting: Improving and Accelerating Retrieval-Augmented GenerationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon," where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates an 93x reduction in compute time while improving accuracy by 43\% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.
- [1060] arXiv:2404.06921 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM ApplicationsShishir G. Patil , Tianjun Zhang , Vivian Fang , Noppapon C. , Roy Huang , Aaron Hao , Martin Casado , Joseph E. Gonzalez , Raluca Ada Popa , Ion StoicaSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses significant challenges as code comprehension is well known to be notoriously difficult. In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future. We argue that in many cases, "post-facto validation" - verifying the correctness of a proposed action after seeing the output - is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks. Using this, a human can now either revert the effect of an LLM-generated output or be confident that the potential risk is bounded. We believe this is critical to unlock the potential for LLM agents to interact with applications and services with limited (post-facto) human involvement. We describe the design and implementation of our open-source runtime for executing LLM actions, Gorilla Execution Engine (GoEX), and present open research questions towards realizing the goal of LLMs and applications interacting with each other with minimal human supervision. We release GoEX at this https URL .
- [1061] arXiv:2404.06939 (cross-list from cs.ET) [ pdf , ps , html , other ]
-
Title: Fast System Technology Co-Optimization Framework for Emerging Technology Based on Graph Neural NetworksComments: Accepted by the 61th Design Automation Conference (DAC)Subjects: Emerging Technologies (cs.ET) ; Artificial Intelligence (cs.AI)
Abstract: This paper proposes a fast system technology co-optimization (STCO) framework that optimizes power, performance, and area (PPA) for next-generation IC design, addressing the challenges and opportunities presented by novel materials and device architectures. We focus on accelerating the technology level of STCO using AI techniques, by employing graph neural network (GNN)-based approaches for both TCAD simulation and cell library characterization, which are interconnected through a unified compact model, collectively achieving over a 100X speedup over traditional methods. These advancements enable comprehensive STCO iterations with runtime speedups ranging from 1.9X to 14.1X and supports both emerging and traditional technologies.
- [1062] arXiv:2404.06948 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MetaCheckGPT -- A Multi-task Hallucination Detector Using LLM Uncertainty and Meta-modelsComments: Entry for SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration MistakesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Hallucinations in large language models (LLMs) have recently become a significant problem. A recent effort in this direction is a shared task at Semeval 2024 Task 6, SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. This paper describes our winning solution ranked 1st and 2nd in the 2 sub-tasks of model agnostic and model aware tracks respectively. We propose a meta-regressor framework of LLMs for model evaluation and integration that achieves the highest scores on the leaderboard. We also experiment with various transformer-based models and black box methods like ChatGPT, Vectara, and others. In addition, we perform an error analysis comparing GPT4 against our best model which shows the limitations of the former.
- [1063] arXiv:2404.06955 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Untangling Critical Interaction with AI in Students Written AssessmentSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Artificial Intelligence (AI) has become a ubiquitous part of society, but a key challenge exists in ensuring that humans are equipped with the required critical thinking and AI literacy skills to interact with machines effectively by understanding their capabilities and limitations. These skills are particularly important for learners to develop in the age of generative AI where AI tools can demonstrate complex knowledge and ability previously thought to be uniquely human. To activate effective human-AI partnerships in writing, this paper provides a first step toward conceptualizing the notion of critical learner interaction with AI. Using both theoretical models and empirical data, our preliminary findings suggest a general lack of Deep interaction with AI during the writing process. We believe that the outcomes can lead to better task and tool design in the future for learners to develop deep, critical thinking when interacting with AI.
- [1064] arXiv:2404.06957 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Adversarial purification for no-reference image-quality metrics: applicability study and new methodsAleksandr Gushchin , Anna Chistyakova , Vladislav Minashkin , Anastasia Antsiferova , Dmitriy VatolinSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recently, the area of adversarial attacks on image quality metrics has begun to be explored, whereas the area of defences remains under-researched. In this study, we aim to cover that case and check the transferability of adversarial purification defences from image classifiers to IQA methods. In this paper, we apply several widespread attacks on IQA models and examine the success of the defences against them. The purification methodologies covered different preprocessing techniques, including geometrical transformations, compression, denoising, and modern neural network-based methods. Also, we address the challenge of assessing the efficacy of a defensive methodology by proposing ways to estimate output visual quality and the success of neutralizing attacks. Defences were tested against attack on three IQA metrics -- Linearity, MetaIQA and SPAQ. The code for attacks and defences is available at: (link is hidden for a blind review).
- [1065] arXiv:2404.06962 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Advancing Real-time Pandemic Forecasting Using Large Language Models: A COVID-19 Case StudyHongru Du , Jianan Zhao , Yang Zhao , Shaochong Xu , Xihong Lin , Yiran Chen , Lauren M. Gardner , Hao Frank YangComments: 35 pages, 10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Forecasting the short-term spread of an ongoing disease outbreak is a formidable challenge due to the complexity of contributing factors, some of which can be characterized through interlinked, multi-modality variables such as epidemiological time series data, viral biology, population demographics, and the intersection of public policy and human behavior. Existing forecasting model frameworks struggle with the multifaceted nature of relevant data and robust results translation, which hinders their performances and the provision of actionable insights for public health decision-makers. Our work introduces PandemicLLM, a novel framework with multi-modal Large Language Models (LLMs) that reformulates real-time forecasting of disease spread as a text reasoning problem, with the ability to incorporate real-time, complex, non-numerical information that previously unattainable in traditional forecasting models. This approach, through a unique AI-human cooperative prompt design and time series representation learning, encodes multi-modal data for LLMs. The model is applied to the COVID-19 pandemic, and trained to utilize textual public health policies, genomic surveillance, spatial, and epidemiological time series data, and is subsequently tested across all 50 states of the U.S. Empirically, PandemicLLM is shown to be a high-performing pandemic forecasting framework that effectively captures the impact of emerging variants and can provide timely and accurate predictions. The proposed PandemicLLM opens avenues for incorporating various pandemic-related data in heterogeneous formats and exhibits performance benefits over existing models. This study illuminates the potential of adapting LLMs and representation learning to enhance pandemic forecasting, illustrating how AI innovations can strengthen pandemic responses and crisis management in the future.
- [1066] arXiv:2404.06971 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TrajPRed: Trajectory Prediction with Region-based Relation LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Forecasting human trajectories in traffic scenes is critical for safety within mixed or fully autonomous systems. Human future trajectories are driven by two major stimuli, social interactions, and stochastic goals. Thus, reliable forecasting needs to capture these two stimuli. Edge-based relation modeling represents social interactions using pairwise correlations from precise individual states. Nevertheless, edge-based relations can be vulnerable under perturbations. To alleviate these issues, we propose a region-based relation learning paradigm that models social interactions via region-wise dynamics of joint states, i.e., the changes in the density of crowds. In particular, region-wise agent joint information is encoded within convolutional feature grids. Social relations are modeled by relating the temporal changes of local joint information from a global perspective. We show that region-based relations are less susceptible to perturbations. In order to account for the stochastic individual goals, we exploit a conditional variational autoencoder to realize multi-goal estimation and diverse future prediction. Specifically, we perform variational inference via the latent distribution, which is conditioned on the correlation between input states and associated target goals. Sampling from the latent distribution enables the framework to reliably capture the stochastic behavior in test data. We integrate multi-goal estimation and region-based relation learning to model the two stimuli, social interactions, and stochastic goals, in a prediction framework. We evaluate our framework on the ETH-UCY dataset and Stanford Drone Dataset (SDD). We show that the diverse prediction better fits the ground truth when incorporating the relation module. Our framework outperforms the state-of-the-art models on SDD by $27.61\%$/$18.20\%$ of ADE/FDE metrics.
- [1067] arXiv:2404.06972 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Toward industrial use of continual learning : new metrics proposal for class incremental learningComments: 7 pages, Accepted at IJCNN 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we investigate continual learning performance metrics used in class incremental learning strategies for continual learning (CL) using some high performing methods. We investigate especially mean task accuracy. First, we show that it lacks of expressiveness through some simple experiments to capture performance. We show that monitoring average tasks performance is over optimistic and can lead to misleading conclusions for future real life industrial uses. Then, we propose first a simple metric, Minimal Incremental Class Accuracy (MICA) which gives a fair and more useful evaluation of different continual learning methods. Moreover, in order to provide a simple way to easily compare different methods performance in continual learning, we derive another single scalar metric that take into account the learning performance variation as well as our newly introduced metric.
- [1068] arXiv:2404.06996 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: XNLIeu: a dataset for cross-lingual NLI in BasqueComments: Accepted to NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.
- [1069] arXiv:2404.07001 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Event Grounded Criminal Court View Generation with Cooperative (Large) Language ModelsComments: Accepted to SIGIR2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With the development of legal intelligence, Criminal Court View Generation has attracted much attention as a crucial task of legal intelligence, which aims to generate concise and coherent texts that summarize case facts and provide explanations for verdicts. Existing researches explore the key information in case facts to yield the court views. Most of them employ a coarse-grained approach that partitions the facts into broad segments (e.g., verdict-related sentences) to make predictions. However, this approach fails to capture the complex details present in the case facts, such as various criminal elements and legal events. To this end, in this paper, we propose an Event Grounded Generation (EGG) method for criminal court view generation with cooperative (Large) Language Models, which introduces the fine-grained event information into the generation. Specifically, we first design a LLMs-based extraction method that can extract events in case facts without massive annotated events. Then, we incorporate the extracted events into court view generation by merging case facts and events. Besides, considering the computational burden posed by the use of LLMs in the extraction phase of EGG, we propose a LLMs-free EGG method that can eliminate the requirement for event extraction using LLMs in the inference phase. Extensive experimental results on a real-world dataset clearly validate the effectiveness of our proposed method.
- [1070] arXiv:2404.07005 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: WordDecipher: Enhancing Digital Workspace Communication with Explainable AI for Non-native English SpeakersComments: The Third Workshop on Intelligent and Interactive Writing Assistants (In2Writing) at CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Non-native English speakers (NNES) face challenges in digital workspace communication (e.g., emails, Slack messages), often inadvertently translating expressions from their native languages, which can lead to awkward or incorrect usage. Current AI-assisted writing tools are equipped with fluency enhancement and rewriting suggestions; however, NNES may struggle to grasp the subtleties among various expressions, making it challenging to choose the one that accurately reflects their intent. Such challenges are exacerbated in high-stake text-based communications, where the absence of non-verbal cues heightens the risk of misinterpretation. By leveraging the latest advancements in large language models (LLM) and word embeddings, we propose WordDecipher, an explainable AI-assisted writing tool to enhance digital workspace communication for NNES. WordDecipher not only identifies the perceived social intentions detected in users' writing, but also generates rewriting suggestions aligned with users' intended messages, either numerically or by inferring from users' writing in their native language. Then, WordDecipher provides an overview of nuances to help NNES make selections. Through a usage scenario, we demonstrate how WordDecipher can significantly enhance an NNES's ability to communicate her request, showcasing its potential to transform workspace communication for NNES.
- [1071] arXiv:2404.07008 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Knowledge graphs for empirical concept retrievalLenka Tětková , Teresa Karen Scheidt , Maria Mandrup Fogh , Ellen Marie Gaunby Jørgensen , Finn Årup Nielsen , Lars Kai HansenComments: Preprint. Accepted to The 2nd World Conference on eXplainable Artificial IntelligenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Concept-based explainable AI is promising as a tool to improve the understanding of complex models at the premises of a given user, viz.\ as a tool for personalized explainability. An important class of concept-based explainability methods is constructed with empirically defined concepts, indirectly defined through a set of positive and negative examples, as in the TCAV approach (Kim et al., 2018). While it is appealing to the user to avoid formal definitions of concepts and their operationalization, it can be challenging to establish relevant concept datasets. Here, we address this challenge using general knowledge graphs (such as, e.g., Wikidata or WordNet) for comprehensive concept definition and present a workflow for user-driven data collection in both text and image domains. The concepts derived from knowledge graphs are defined interactively, providing an opportunity for personalization and ensuring that the concepts reflect the user's intentions. We test the retrieved concept datasets on two concept-based explainability methods, namely concept activation vectors (CAVs) and concept activation regions (CARs) (Crabbe and van der Schaar, 2022). We show that CAVs and CARs based on these empirical concept datasets provide robust and accurate explanations. Importantly, we also find good alignment between the models' representations of concepts and the structure of knowledge graphs, i.e., human representations. This supports our conclusion that knowledge graph-based concepts are relevant for XAI.
- [1072] arXiv:2404.07017 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Improving Language Model Reasoning with Self-motivated LearningComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large-scale high-quality training data is important for improving the performance of models. After trained with data that has rationales (reasoning steps), models gain reasoning capability. However, the dataset with high-quality rationales is relatively scarce due to the high annotation cost. To address this issue, we propose \textit{Self-motivated Learning} framework. The framework motivates the model itself to automatically generate rationales on existing datasets. Based on the inherent rank from correctness across multiple rationales, the model learns to generate better rationales, leading to higher reasoning capability. Specifically, we train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning. Experiment results of Llama2 7B on multiple reasoning datasets show that our method significantly improves the reasoning ability of models, even outperforming text-davinci-002 in some datasets.
- [1073] arXiv:2404.07046 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Comparison of decision trees with Local Interpretable Model-Agnostic Explanations (LIME) technique and multi-linear regression for explaining support vector regression model in terms of root mean square error (RMSE) valuesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this work the decision trees are used for explanation of support vector regression model. The decision trees act as a global technique as well as a local technique. They are compared against the popular technique of LIME which is a local explanatory technique and with multi linear regression. It is observed that decision trees give a lower RMSE value when fitted to support vector regression as compared to LIME in 87% of the runs over 5 datasets. The comparison of results is statistically significant. Multi linear regression also gives a lower RMSE value when fitted to support vector regression model as compared to LIME in 73% of the runs over 5 datasets but the comparison of results is not statistically significant. Also, when used as a local explanatory technique, decision trees give better performance than LIME and the comparison of results is statistically significant.
- [1074] arXiv:2404.07053 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and InterpretationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Metaphors, although occasionally unperceived, are ubiquitous in our everyday language. Thus, it is crucial for Language Models to be able to grasp the underlying meaning of this kind of figurative language. In this work, we present Meta4XNLI, a novel parallel dataset for the tasks of metaphor detection and interpretation that contains metaphor annotations in both Spanish and English. We investigate language models' metaphor identification and understanding abilities through a series of monolingual and cross-lingual experiments by leveraging our proposed corpus. In order to comprehend how these non-literal expressions affect models' performance, we look over the results and perform an error analysis. Additionally, parallel data offers many potential opportunities to investigate metaphor transferability between these languages and the impact of translation on the development of multilingual annotated resources.
- [1075] arXiv:2404.07063 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: LaPlaSS: Latent Space Planning for Stochastic SystemsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Autonomous mobile agents often operate in hazardous environments, necessitating an awareness of safety. These agents can have non-linear, stochastic dynamics that must be considered during planning to guarantee bounded risk. Most state of the art methods require closed-form dynamics to verify plan correctness and safety however modern robotic systems often have dynamics that are learned from data. Thus, there is a need to perform efficient trajectory planning with guarantees on risk for agents without known dynamics models. We propose a "generate-and-test" approach to risk-bounded planning in which a planner generates a candidate trajectory using an approximate linear dynamics model and a validator assesses the risk of the trajectory, computing additional safety constraints for the planner if the candidate does not satisfy the desired risk bound. To acquire the approximate model, we use a variational autoencoder to learn a latent linear dynamics model and encode the planning problem into the latent space to generate the candidate trajectory. The VAE also serves to sample trajectories around the candidate to use in the validator. We demonstrate that our algorithm, LaPlaSS, is able to generate trajectory plans with bounded risk for a real-world agent with learned dynamics and is an order of magnitude more efficient than the state of the art.
- [1076] arXiv:2404.07066 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?Mingyu Jin , Qinkai Yu , Jingyuan Huang , Qingcheng Zeng , Zhenting Wang , Wenyue Hua , Haiyan Zhao , Kai Mei , Yanda Meng , Kaize Ding , Fan Yang , Mengnan Du , Yongfeng ZhangComments: 12 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of "Concept Depth" to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, QWen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at this https URL .
- [1077] arXiv:2404.07084 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Dynamic Generation of Personalities with Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of mimicking human deliberation, large language models (LLMs) show promising performance, thereby amplifying the importance of this research area. Deliberation is influenced by both logic and personality. However, previous studies predominantly focused on the logic of LLMs, neglecting the exploration of personality aspects. In this work, we introduce Dynamic Personality Generation (DPG), a dynamic personality generation method based on Hypernetworks. Initially, we embed the Big Five personality theory into GPT-4 to form a personality assessment machine, enabling it to evaluate characters' personality traits from dialogues automatically. We propose a new metric to assess personality generation capability based on this evaluation method. Then, we use this personality assessment machine to evaluate dialogues in script data, resulting in a personality-dialogue dataset. Finally, we fine-tune DPG on the personality-dialogue dataset. Experiments prove that DPG's personality generation capability is stronger after fine-tuning on this dataset than traditional fine-tuning methods, surpassing prompt-based GPT-4.
- [1078] arXiv:2404.07091 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LaTiM: Longitudinal representation learning in continuous-time models to predict disease progressionRachid Zeghlache , Pierre-Henri Conze , Mostafa El Habib Daho , Yihao Li , Hugo Le Boité , Ramin Tadayoni , Pascal Massin , Béatrice Cochener , Alireza Rezaei , Ikram Brahim , Gwenolé Quellec , Mathieu LamardComments: Submitted to MICCAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This work proposes a novel framework for analyzing disease progression using time-aware neural ordinary differential equations (NODE). We introduce a "time-aware head" in a framework trained through self-supervised learning (SSL) to leverage temporal information in latent space for data augmentation. This approach effectively integrates NODEs with SSL, offering significant performance improvements compared to traditional methods that lack explicit temporal integration. We demonstrate the effectiveness of our strategy for diabetic retinopathy progression prediction using the OPHDIAT database. Compared to the baseline, all NODE architectures achieve statistically significant improvements in area under the ROC curve (AUC) and Kappa metrics, highlighting the efficacy of pre-training with SSL-inspired approaches. Additionally, our framework promotes stable training for NODEs, a commonly encountered challenge in time-aware modeling.
- [1079] arXiv:2404.07096 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: TransTARec: Time-Adaptive Translating Embedding Model for Next POI RecommendationComments: This paper has been accepted by the 2024 5th International Conference on Computer Engineering and Application (ICCEA 2024)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: The rapid growth of location acquisition technologies makes Point-of-Interest(POI) recommendation possible due to redundant user check-in records. In this paper, we focus on next POI recommendation in which next POI is based on previous POI. We observe that time plays an important role in next POI recommendation but is neglected in the recent proposed translating embedding methods. To tackle this shortage, we propose a time-adaptive translating embedding model (TransTARec) for next POI recommendation that naturally incorporates temporal influence, sequential dynamics, and user preference within a single component. Methodologically, we treat a (previous timestamp, user, next timestamp) triplet as a union translation vector and develop a neural-based fusion operation to fuse user preference and temporal influence. The superiority of TransTARec, which is confirmed by extensive experiments on real-world datasets, comes from not only the introduction of temporal influence but also the direct unification with user preference and sequential dynamics.
- [1080] arXiv:2404.07099 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and DetectionComments: Accepted as a full paper to the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: While reinforcement learning (RL) algorithms have been successfully applied across numerous sequential decision-making problems, their generalization to unforeseen testing environments remains a significant concern. In this paper, we study the problem of out-of-distribution (OOD) detection in RL, which focuses on identifying situations at test time that RL agents have not encountered in their training environments. We first propose a clarification of terminology for OOD detection in RL, which aligns it with the literature from other machine learning domains. We then present new benchmark scenarios for OOD detection, which introduce anomalies with temporal autocorrelation into different components of the agent-environment loop. We argue that such scenarios have been understudied in the current literature, despite their relevance to real-world situations. Confirming our theoretical predictions, our experimental results suggest that state-of-the-art OOD detectors are not able to identify such anomalies. To address this problem, we propose a novel method for OOD detection, which we call DEXTER (Detection via Extraction of Time Series Representations). By treating environment observations as time series data, DEXTER extracts salient time series features, and then leverages an ensemble of isolation forest algorithms to detect anomalies. We find that DEXTER can reliably identify anomalies across benchmark scenarios, exhibiting superior performance compared to both state-of-the-art OOD detectors and high-dimensional changepoint detectors adopted from statistics.
- [1081] arXiv:2404.07123 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Semantically-correlated memories in a dense associative modelComments: 35 pages, 32 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Abstract: I introduce a novel associative memory model named Correlated Dense Associative Memory (CDAM), which integrates both auto- and hetero-association in a unified framework for continuous-valued memory patterns. Employing an arbitrary graph structure to semantically link memory patterns, CDAM is theoretically and numerically analysed, revealing four distinct dynamical modes: auto-association, narrow hetero-association, wide hetero-association, and neutral quiescence. Drawing inspiration from inhibitory modulation studies, I employ anti-Hebbian learning rules to control the range of hetero-association, extract multi-scale representations of community structures in graphs, and stabilise the recall of temporal sequences. Experimental demonstrations showcase CDAM's efficacy in handling real-world data, replicating a classical neuroscience experiment, performing image retrieval, and simulating arbitrary finite automata.
- [1082] arXiv:2404.07124 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Measuring proximity to standard planes during fetal brain ultrasound scanningChiara Di Vece , Antonio Cirigliano , Meala Le Lous , Raffaele Napolitano , Anna L. David , Donald Peebles , Pierre Jannin , Francisco Vasconcelos , Danail StoyanovComments: 11 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces a novel pipeline designed to bring ultrasound (US) plane pose estimation closer to clinical use for more effective navigation to the standard planes (SPs) in the fetal brain. We propose a semi-supervised segmentation model utilizing both labeled SPs and unlabeled 3D US volume slices. Our model enables reliable segmentation across a diverse set of fetal brain images. Furthermore, the model incorporates a classification mechanism to identify the fetal brain precisely. Our model not only filters out frames lacking the brain but also generates masks for those containing it, enhancing the relevance of plane pose regression in clinical settings. We focus on fetal brain navigation from 2D ultrasound (US) video analysis and combine this model with a US plane pose regression network to provide sensorless proximity detection to SPs and non-SPs planes; we emphasize the importance of proximity detection to SPs for guiding sonographers, offering a substantial advantage over traditional methods by allowing earlier and more precise adjustments during scanning. We demonstrate the practical applicability of our approach through validation on real fetal scan videos obtained from sonographers of varying expertise levels. Our findings demonstrate the potential of our approach to complement existing fetal US technologies and advance prenatal diagnostic practices.
- [1083] arXiv:2404.07135 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Robustness of Text-to-Visualization Translation against Lexical and Phrasal VariabilitySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-Vis is an emerging task in the natural language processing (NLP) area that aims to automatically generate data visualizations from natural language questions (NLQs). Despite their progress, existing text-to-vis models often heavily rely on lexical matching between words in the questions and tokens in data schemas. This overreliance on lexical matching may lead to a diminished level of model robustness against input variations. In this study, we thoroughly examine the robustness of current text-to-vis models, an area that has not previously been explored. In particular, we construct the first robustness dataset nvBench-Rob, which contains diverse lexical and phrasal variations based on the original text-to-vis benchmark nvBench. Then, we found that the performance of existing text-to-vis models on this new dataset dramatically drops, implying that these methods exhibit inadequate robustness overall. Finally, we propose a novel framework based on Retrieval-Augmented Generation (RAG) technique, named GRED, specifically designed to address input perturbations in these two variants. The framework consists of three parts: NLQ-Retrieval Generator, Visualization Query-Retrieval Retuner and Annotation-based Debugger, which are used to tackle the challenges posed by natural language variants, programming style differences and data schema variants, respectively. Extensive experimental evaluations show that, compared to the state-of-the-art model RGVisNet in the Text-to-Vis field, GRED performs better in terms of model robustness, with a 32% increase in accuracy on the proposed nvBench-Rob dataset.
- [1084] arXiv:2404.07143 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attentionComments: 9 pages, 4 figures, 4 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.
- [1085] arXiv:2404.07164 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory SystemSteve Rhyner , Haocong Luo , Juan Gómez-Luna , Mohammad Sadrosadati , Jiawei Jiang , Ataberk Olgun , Harshita Gupta , Ce Zhang , Onur MutluSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Abstract: Machine Learning (ML) training on large-scale datasets is a very expensive and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i.e., due to repeatedly accessing the training dataset. As a result, processor-centric systems suffer from performance degradation and high energy consumption. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory.
Our goal is to understand the capabilities and characteristics of popular distributed optimization algorithms on real-world PIM architectures to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized distributed optimization algorithms on UPMEM's real-world general-purpose PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and the need to shift to an algorithm-hardware codesign perspective to accommodate decentralized distributed optimization algorithms.
Our results demonstrate three major findings: 1) Modern general-purpose PIM architectures can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, when operations and datatypes are natively supported by PIM hardware, 2) the importance of carefully choosing the optimization algorithm that best fit PIM, and 3) contrary to popular belief, contemporary PIM architectures do not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. To facilitate future research, we aim to open-source our complete codebase. - [1086] arXiv:2404.07168 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Using Neural Networks to Model Hysteretic Kinematics in Tendon-Actuated Continuum RobotsYuan Wang , Max McCandless , Abdulhamit Donder , Giovanni Pittiglio , Behnam Moradkhani , Yash Chitalia , Pierre E. DupontComments: 7 pages, 8 figures, conferenceSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The ability to accurately model mechanical hysteretic behavior in tendon-actuated continuum robots using deep learning approaches is a growing area of interest. In this paper, we investigate the hysteretic response of two types of tendon-actuated continuum robots and, ultimately, compare three types of neural network modeling approaches with both forward and inverse kinematic mappings: feedforward neural network (FNN), FNN with a history input buffer, and long short-term memory (LSTM) network. We seek to determine which model best captures temporal dependent behavior. We find that, depending on the robot's design, choosing different kinematic inputs can alter whether hysteresis is exhibited by the system. Furthermore, we present the results of the model fittings, revealing that, in contrast to the standard FNN, both FNN with a history input buffer and the LSTM model exhibit the capacity to model historical dependence with comparable performance in capturing rate-dependent hysteresis.
- [1087] arXiv:2404.07170 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Worst-Case Convergence Time of ML Algorithms via Extreme Value TheoryComments: In 3rd International Conference on AI Engineering: Software Engineering for AI (CAIN 2024)Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL)
Abstract: This paper leverages the statistics of extreme values to predict the worst-case convergence times of machine learning algorithms. Timing is a critical non-functional property of ML systems, and providing the worst-case converge times is essential to guarantee the availability of ML and its services. However, timing properties such as worst-case convergence times (WCCT) are difficult to verify since (1) they are not encoded in the syntax or semantics of underlying programming languages of AI, (2) their evaluations depend on both algorithmic implementations and underlying systems, and (3) their measurements involve uncertainty and noise. Therefore, prevalent formal methods and statistical models fail to provide rich information on the amounts and likelihood of WCCT.
Our key observation is that the timing information we seek represents the extreme tail of execution times. Therefore, extreme value theory (EVT), a statistical discipline that focuses on understanding and predicting the distribution of extreme values in the tail of outcomes, provides an ideal framework to model and analyze WCCT in the training and inference phases of ML paradigm. Building upon the mathematical tools from EVT, we propose a practical framework to predict the worst-case timing properties of ML. Over a set of linear ML training algorithms, we show that EVT achieves a better accuracy for predicting WCCTs than relevant statistical methods such as the Bayesian factor. On the set of larger machine learning training algorithms and deep neural network inference, we show the feasibility and usefulness of EVT models to accurately predict WCCTs, their expected return periods, and their likelihood. - [1088] arXiv:2404.07185 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Reward Learning from Suboptimal Demonstrations with Applications in Surgical ElectrocauteryComments: In proceedings of the International Symposium on Medical Robotics (ISMR) 2024. Equal contribution from two first authorsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automating robotic surgery via learning from demonstration (LfD) techniques is extremely challenging. This is because surgical tasks often involve sequential decision-making processes with complex interactions of physical objects and have low tolerance for mistakes. Prior works assume that all demonstrations are fully observable and optimal, which might not be practical in the real world. This paper introduces a sample-efficient method that learns a robust reward function from a limited amount of ranked suboptimal demonstrations consisting of partial-view point cloud observations. The method then learns a policy by optimizing the learned reward function using reinforcement learning (RL). We show that using a learned reward function to obtain a policy is more robust than pure imitation learning. We apply our approach on a physical surgical electrocautery task and demonstrate that our method can perform well even when the provided demonstrations are suboptimal and the observations are high-dimensional point clouds. Code and videos available here: this https URL
- [1089] arXiv:2404.07194 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site IdentificationFlorian Sestak , Lisa Schneckenreiter , Johannes Brandstetter , Sepp Hochreiter , Andreas Mayr , Günter KlambauerSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Abstract: Being able to identify regions within or around proteins, to which ligands can potentially bind, is an essential step to develop new drugs. Binding site identification methods can now profit from the availability of large amounts of 3D structures in protein structure databases or from AlphaFold predictions. Current binding site identification methods heavily rely on graph neural networks (GNNs), usually designed to output E(3)-equivariant predictions. Such methods turned out to be very beneficial for physics-related tasks like binding energy or motion trajectory prediction. However, the performance of GNNs at binding site identification is still limited potentially due to the lack of dedicated nodes that model hidden geometric entities, such as binding pockets. In this work, we extend E(n)-Equivariant Graph Neural Networks (EGNNs) by adding virtual nodes and applying an extended message passing scheme. The virtual nodes in these graphs are dedicated quantities to learn representations of binding sites, which leads to improved predictive performance. In our experiments, we show that our proposed method VN-EGNN sets a new state-of-the-art at locating binding site centers on COACH420, HOLO4K and PDBbind2020.
- [1090] arXiv:2404.07199 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth DiffusionComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
- [1091] arXiv:2404.07202 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: UMBRAE: Unified Multimodal Decoding of Brain SignalsComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We address prevailing challenges of the brain-powered research, departing from the observation that the literature hardly recover accurate spatial information and require subject-specific models. To address these challenges, we propose UMBRAE, a unified multimodal decoding of brain signals. First, to extract instance-level conceptual and spatial details from neural signals, we introduce an efficient universal brain encoder for multimodal-brain alignment and recover object descriptions at multiple levels of granularity from subsequent multimodal large language model (MLLM). Second, we introduce a cross-subject training strategy mapping subject-specific features to a common feature space. This allows a model to be trained on multiple subjects without extra resources, even yielding superior results compared to subject-specific models. Further, we demonstrate this supports weakly-supervised adaptation to new subjects, with only a fraction of the total training data. Experiments demonstrate that UMBRAE not only achieves superior results in the newly introduced tasks but also outperforms methods in well established tasks. To assess our method, we construct and share with the community a comprehensive brain understanding benchmark BrainHub. Our code and benchmark are available at this https URL .
- [1092] arXiv:2404.07204 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: BRAVE: Broadening the visual encoding of vision-language modelsOğuzhan Fatih Kar , Alessio Tonioni , Petra Poklukar , Achin Kulshrestha , Amir Zamir , Federico TombariComments: Project page at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
- [1093] arXiv:2404.07206 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GoodDrag: Towards Good Practices for Drag Editing with Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is this https URL .
- [1094] arXiv:2404.07208 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Uncertainty-guided annotation enhances segmentation with the human-in-the-loopSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Deep learning algorithms, often critiqued for their 'black box' nature, traditionally fall short in providing the necessary transparency for trusted clinical use. This challenge is particularly evident when such models are deployed in local hospitals, encountering out-of-domain distributions due to varying imaging techniques and patient-specific pathologies. Yet, this limitation offers a unique avenue for continual learning. The Uncertainty-Guided Annotation (UGA) framework introduces a human-in-the-loop approach, enabling AI to convey its uncertainties to clinicians, effectively acting as an automated quality control mechanism. UGA eases this interaction by quantifying uncertainty at the pixel level, thereby revealing the model's limitations and opening the door for clinician-guided corrections. We evaluated UGA on the Camelyon dataset for lymph node metastasis segmentation which revealed that UGA improved the Dice coefficient (DC), from 0.66 to 0.76 by adding 5 patches, and further to 0.84 with 10 patches. To foster broader application and community contribution, we have made our code accessible at
- [1095] arXiv:2404.07211 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A real-time Artificial Intelligence system for learning Sign LanguageSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: A primary challenge for the deaf and hearing-impaired community stems from the communication gap with the hearing society, which can greatly impact their daily lives and result in social exclusion. To foster inclusivity in society, our endeavor focuses on developing a cost-effective, resource-efficient, and open technology based on Artificial Intelligence, designed to assist people in learning and using Sign Language for communication. The analysis presented in this research paper intends to enrich the recent academic scientific literature on Sign Language solutions based on Artificial Intelligence, with a particular focus on American Sign Language (ASL). This research has yielded promising preliminary results and serves as a basis for further development.
- [1096] arXiv:2404.07212 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Hybrid Training of Denoising Networks to Improve the Texture Acutance of Digital CamerasJournal-ref: Scale Space and Variational Methods in Computer Vision, May 2023, Santa Margherita di Pula, Italy. pp.314-325Subjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In order to evaluate the capacity of a camera to render textures properly, the standard practice, used by classical scoring protocols, is to compute the frequential response to a dead leaves image target, from which is built a texture acutance metric. In this work, we propose a mixed training procedure for image restoration neural networks, relying on both natural and synthetic images, that yields a strong improvement of this acutance metric without impairing fidelity terms. The feasibility of the approach is demonstrated both on the denoising of RGB images and the full development of RAW images, opening the path to a systematic improvement of the texture acutance of real imaging devices.
- [1097] arXiv:2404.07213 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Evolving Genetic Programming Tree Models for Predicting the Mechanical Properties of Green Fibers for Better Biocomposite MaterialsSubjects: Neural and Evolutionary Computing (cs.NE) ; Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
Abstract: Advanced modern technology and industrial sustainability theme have contributed implementing composite materials for various industrial applications. Green composites are among the desired alternatives for the green products. However, to properly control the performance of the green composites, predicting their constituents properties are of paramount importance. This work presents an innovative evolving genetic programming tree models for predicting the mechanical properties of natural fibers based upon several inherent chemical and physical properties. Cellulose, hemicellulose, lignin and moisture contents as well as the Microfibrillar angle of various natural fibers were considered to establish the prediction models. A one-hold-out methodology was applied for training/testing phases. Robust models were developed to predict the tensile strength, Young's modulus, and the elongation at break properties of the natural fibers. It was revealed that Microfibrillar angle was dominant and capable of determining the ultimate tensile strength of the natural fibers by 44.7% comparable to other considered properties, while the impact of cellulose content in the model was only 35.6%. This in order would facilitate utilizing artificial intelligence in predicting the overall mechanical properties of natural fibers without experimental efforts and cost to enhance developing better green composite materials for various industrial applications.
- [1098] arXiv:2404.07214 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future DirectionsComments: The most extensive and up to date Survey on Visual Language Models covering 76 Visual Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.
- [1099] arXiv:2404.07215 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Computation Offloading for Multi-server Multi-access Edge Vehicular Networks: A DDQN-based MethodSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: In this paper, we investigate a multi-user offloading problem in the overlapping domain of a multi-server mobile edge computing system. We divide the original problem into two stages: the offloading decision making stage and the request scheduling stage. To prevent the terminal from going out of service area during offloading, we consider the mobility parameter of the terminal according to the human behaviour model when making the offloading decision, and then introduce a server evaluation mechanism based on both the mobility parameter and the server load to select the optimal offloading server. In order to fully utilise the server resources, we design a double deep Q-network (DDQN)-based reward evaluation algorithm that considers the priority of tasks when scheduling offload requests. Finally, numerical simulations are conducted to verify that our proposed method outperforms traditional mathematical computation methods as well as the DQN algorithm.
- [1100] arXiv:2404.07216 (cross-list from eess.IV) [ pdf , ps , other ]
-
Title: A Bio-Medical Snake Optimizer System Driven by Logarithmic Surviving Global Search for Optimizing Feature Selection and its application for Disorder RecognitionSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: It is of paramount importance to enhance medical practices, given how important it is to protect human life. Medical therapy can be accelerated by automating patient prediction using machine learning techniques. To double the efficiency of classifiers, several preprocessing strategies must be adopted for their crucial duty in this field. Feature selection (FS) is one tool that has been used frequently to modify data and enhance classification outcomes by lowering the dimensionality of datasets. Excluded features are those that have a poor correlation coefficient with the label class, that is, they have no meaningful correlation with classification and do not indicate where the instance belongs. Along with the recurring features, which show a strong association with the remainder of the features. Contrarily, the model being produced during training is harmed, and the classifier is misled by their presence. This causes overfitting and increases algorithm complexity and processing time. These are used in exploration to allow solutions to be found more thoroughly and in relation to a chosen solution than at random. TLSO, PLSO, and LLSO stand for Tournament Logarithmic Snake Optimizer, Proportional Logarithmic Snake Optimizer, and Linear Order Logarithmic Snake Optimizer, respectively. A number of 22 reference medical datasets were used in experiments. The findings indicate that, among 86 % of the datasets, TLSO attained the best accuracy, and among 82 % of the datasets, the best feature reduction. In terms of the standard deviation, the TLSO also attained noteworthy reliability and stability. On the basis of running duration, it is, nonetheless, quite effective.
- [1101] arXiv:2404.07217 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Attention-aware Semantic Communications for Collaborative InferenceSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: We propose a communication-efficient collaborative inference framework in the domain of edge inference, focusing on the efficient use of vision transformer (ViTs) models. The partitioning strategy of conventional collaborative inference fails to reduce communication cost because of the inherent architecture of ViTs maintaining consistent layer dimensions across the entire transformer encoder. Therefore, instead of employing the partitioning strategy, our framework utilizes a lightweight ViT model on the edge device, with the server deploying a complicated ViT model. To enhance communication efficiency and achieve the classification accuracy of the server model, we propose two strategies: 1) attention-aware patch selection and 2) entropy-aware image transmission. Attention-aware patch selection leverages the attention scores generated by the edge device's transformer encoder to identify and select the image patches critical for classification. This strategy enables the edge device to transmit only the essential patches to the server, significantly improving communication efficiency. Entropy-aware image transmission uses min-entropy as a metric to accurately determine whether to depend on the lightweight model on the edge device or to request the inference from the server model. In our framework, the lightweight ViT model on the edge device acts as a semantic encoder, efficiently identifying and selecting the crucial image information required for the classification task. Our experiments demonstrate that the proposed collaborative inference framework can reduce communication overhead by 68% with only a minimal loss in accuracy compared to the server model.
- [1102] arXiv:2404.07220 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based RetrieversComments: Pre-print version of paper submitted to conferenceSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an outsized role in the overall RAG accuracy by extracting the most relevant document from the corpus to provide context to the LLM. In this paper, we propose the 'Blended RAG' method of leveraging semantic search techniques, such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid query strategies. Our study achieves better retrieval results and sets new benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID datasets. We further extend such a 'Blended Retriever' to the RAG system to demonstrate far superior results on Generative Q\&A datasets like SQUAD, even surpassing fine-tuning performance.
- [1103] arXiv:2404.07223 (cross-list from q-fin.ST) [ pdf , ps , html , other ]
-
Title: Stock Recommendations for Individual Investors: A Temporal Graph Network Approach with Diversification-Enhancing Contrastive LearningComments: Presented at the ICAIF 2023 Workshop on Machine Learning for Investor Modelling and Recommender Systems ( this https URL )Subjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In complex financial markets, recommender systems can play a crucial role in empowering individuals to make informed decisions. Existing studies predominantly focus on price prediction, but even the most sophisticated models cannot accurately predict stock prices. Also, many studies show that most individual investors do not follow established investment theories because they have their own preferences. Hence, the tricky point in stock recommendation is that recommendations should give good investment performance but also should not ignore individual preferences. To develop effective stock recommender systems, it is essential to consider three key aspects: 1) individual preferences, 2) portfolio diversification, and 3) temporal aspect of both stock features and individual preferences. In response, we develop the portfolio temporal graph network recommender PfoTGNRec, which can handle time-varying collaborative signals and incorporates diversification-enhancing contrastive learning. As a result, our model demonstrated superior performance compared to various baselines, including cutting-edge dynamic embedding models and existing stock recommendation models, in a sense that our model exhibited good investment performance while maintaining competitive in capturing individual preferences. The source code and data are available at https://anonymous.4open.science/r/IJCAI2024-12F4.
- [1104] arXiv:2404.07225 (cross-list from q-fin.ST) [ pdf , ps , other ]
-
Title: Unveiling the Impact of Macroeconomic Policies: A Double Machine Learning Approach to Analyzing Interest Rate Effects on Financial MarketsSubjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study examines the effects of macroeconomic policies on financial markets using a novel approach that combines Machine Learning (ML) techniques and causal inference. It focuses on the effect of interest rate changes made by the US Federal Reserve System (FRS) on the returns of fixed income and equity funds between January 1986 and December 2021. The analysis makes a distinction between actively and passively managed funds, hypothesizing that the latter are less susceptible to changes in interest rates. The study contrasts gradient boosting and linear regression models using the Double Machine Learning (DML) framework, which supports a variety of statistical learning techniques. Results indicate that gradient boosting is a useful tool for predicting fund returns; for example, a 1% increase in interest rates causes an actively managed fund's return to decrease by -11.97%. This understanding of the relationship between interest rates and fund performance provides opportunities for additional research and insightful, data-driven advice for fund managers and investors
- [1105] arXiv:2404.07229 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Personality-affected Emotion Generation in Dialog SystemsComments: Accepted by ACM Transactions on Information SystemsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Generating appropriate emotions for responses is essential for dialog systems to provide human-like interaction in various application scenarios. Most previous dialog systems tried to achieve this goal by learning empathetic manners from anonymous conversational data. However, emotional responses generated by those methods may be inconsistent, which will decrease user engagement and service quality. Psychological findings suggest that the emotional expressions of humans are rooted in personality traits. Therefore, we propose a new task, Personality-affected Emotion Generation, to generate emotion based on the personality given to the dialog system and further investigate a solution through the personality-affected mood transition. Specifically, we first construct a daily dialog dataset, Personality EmotionLines Dataset (PELD), with emotion and personality annotations. Subsequently, we analyze the challenges in this task, i.e., (1) heterogeneously integrating personality and emotional factors and (2) extracting multi-granularity emotional information in the dialog context. Finally, we propose to model the personality as the transition weight by simulating the mood transition process in the dialog system and solve the challenges above. We conduct extensive experiments on PELD for evaluation. Results suggest that by adopting our method, the emotion generation performance is improved by 13% in macro-F1 and 5% in weighted-F1 from the BERT-base model.
- [1106] arXiv:2404.07230 (cross-list from math.GM) [ pdf , ps , html , other ]
-
Title: Interval-valued fuzzy soft $\beta$-covering approximation spacesComments: 12 pagesSubjects: General Mathematics (math.GM) ; Artificial Intelligence (cs.AI)
Abstract: The concept of interval-valued fuzzy soft $\beta$-covering approximation spaces (IFS$\beta$CASs) is introduced to combine the theories of soft sets, rough sets and interval-valued fuzzy sets, and some fundamental propositions concerning interval-valued fuzzy soft $\beta$-neighborhoods and soft $\beta$-neighborhoods of IFS$\beta$CASs are explored. And then four kinds of interval-valued fuzzy soft $\beta$-coverings based fuzzy rough sets are researched. Finally, the relationships of four kinds of interval-valued fuzzy soft $\beta$-coverings based fuzzy rough sets are investigated.
- [1107] arXiv:2404.07234 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Goal-guided Generative Prompt Injection Attack on Large Language ModelsComments: 22 pages, 8 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Current large language models (LLMs) provide a strong foundation for large-scale user-oriented natural language tasks. A large number of users can easily inject adversarial text or instructions through the user interface, thus causing LLMs model security challenges. Although there is currently a large amount of research on prompt injection attacks, most of these black-box attacks use heuristic strategies. It is unclear how these heuristic strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we redefine the goal of the attack: to maximize the KL divergence between the conditional probabilities of the clean text and the adversarial text. Furthermore, we prove that maximizing the KL divergence is equivalent to maximizing the Mahalanobis distance between the embedded representation $x$ and $x'$ of the clean text and the adversarial text when the conditional probability is a Gaussian distribution and gives a quantitative relationship on $x$ and $x'$. Then we designed a simple and effective goal-guided generative prompt injection strategy (G2PIA) to find an injection text that satisfies specific constraints to achieve the optimal attack effect approximately. It is particularly noteworthy that our attack method is a query-free black-box attack method with low computational cost. Experimental results on seven LLM models and four datasets show the effectiveness of our attack method.
- [1108] arXiv:2404.07235 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: Explaining EDA synthesis errors with LLMsComments: 6 pages, 6 figuresSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
Abstract: Training new engineers in digital design is a challenge, particularly when it comes to teaching the complex electronic design automation (EDA) tooling used in this domain. Learners will typically deploy designs in the Verilog and VHDL hardware description languages to Field Programmable Gate Arrays (FPGAs) from Altera (Intel) and Xilinx (AMD) via proprietary closed-source toolchains (Quartus Prime and Vivado, respectively). These tools are complex and difficult to use -- yet, as they are the tools used in industry, they are an essential first step in this space. In this work, we examine how recent advances in artificial intelligence may be leveraged to address aspects of this challenge. Specifically, we investigate if Large Language Models (LLMs), which have demonstrated text comprehension and question-answering capabilities, can be used to generate novice-friendly explanations of compile-time synthesis error messages from Quartus Prime and Vivado. To perform this study we generate 936 error message explanations using three OpenAI LLMs over 21 different buggy code samples. These are then graded for relevance and correctness, and we find that in approximately 71% of cases the LLMs give correct & complete explanations suitable for novice learners.
- [1109] arXiv:2404.07239 (cross-list from q-bio.QM) [ pdf , ps , other ]
-
Title: Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer DiagnosisMilad Yousefi , Shadi Farabi Maleki , Ali Jafarizadeh , Mahya Ahmadpour Youshanlui , Aida Jafari , Siamak Pedrammehr , Roohallah Alizadehsani , Ryszard Tadeusiewicz , Pawel PlawiakComments: 50 pages, 8 figures, 1 table, 119 referencesSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: Thyroid cancer is an increasing global health concern that requires advanced diagnostic methods. The application of AI and radiomics to thyroid cancer diagnosis is examined in this review. A review of multiple databases was conducted in compliance with PRISMA guidelines until October 2023. A combination of keywords led to the discovery of an English academic publication on thyroid cancer and related subjects. 267 papers were returned from the original search after 109 duplicates were removed. Relevant studies were selected according to predetermined criteria after 124 articles were eliminated based on an examination of their abstract and title. After the comprehensive analysis, an additional six studies were excluded. Among the 28 included studies, radiomics analysis, which incorporates ultrasound (US) images, demonstrated its effectiveness in diagnosing thyroid cancer. Various results were noted, some of the studies presenting new strategies that outperformed the status quo. The literature has emphasized various challenges faced by AI models, including interpretability issues, dataset constraints, and operator dependence. The synthesized findings of the 28 included studies mentioned the need for standardization efforts and prospective multicenter studies to address these concerns. Furthermore, approaches to overcome these obstacles were identified, such as advances in explainable AI technology and personalized medicine techniques. The review focuses on how AI and radiomics could transform the diagnosis and treatment of thyroid cancer. Despite challenges, future research on multidisciplinary cooperation, clinical applicability validation, and algorithm improvement holds the potential to improve patient outcomes and diagnostic precision in the treatment of thyroid cancer.
- [1110] arXiv:2404.07242 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Sandwich attack: Multi-language Mixture Adaptive Attack on LLMsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) are increasingly being developed and applied, but their widespread use faces challenges. These include aligning LLMs' responses with human values to prevent harmful outputs, which is addressed through safety training methods. Even so, bad actors and malicious users have succeeded in attempts to manipulate the LLMs to generate misaligned responses for harmful questions such as methods to create a bomb in school labs, recipes for harmful drugs, and ways to evade privacy rights. Another challenge is the multilingual capabilities of LLMs, which enable the model to understand and respond in multiple languages. Consequently, attackers exploit the unbalanced pre-training datasets of LLMs in different languages and the comparatively lower model performance in low-resource languages than high-resource ones. As a result, attackers use a low-resource languages to intentionally manipulate the model to create harmful responses. Many of the similar attack vectors have been patched by model providers, making the LLMs more robust against language-based manipulation. In this paper, we introduce a new black-box attack vector called the \emph{Sandwich attack}: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses and elicit misaligned responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse.
- [1111] arXiv:2404.07245 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative Resident Separation and Multi-label Classification for Multi-person Activity RecognitionComments: Context and Activity Modeling and Recognition (CoMoReA) Workshop at IEEE International Conference on Pervasive Computing and Communications (PerCom 2024), Mar 2024, Biarritz, FranceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: This paper presents two models to address the problem of multi-person activity recognition using ambient sensors in a home. The first model, Seq2Res, uses a sequence generation approach to separate sensor events from different residents. The second model, BiGRU+Q2L, uses a Query2Label multi-label classifier to predict multiple activities simultaneously. Performances of these models are compared to a state-of-the-art model in different experimental scenarios, using a state-of-the-art dataset of two residents in a home instrumented with ambient sensors. These results lead to a discussion on the advantages and drawbacks of resident separation and multi-label classification for multi-person activity recognition.
- [1112] arXiv:2404.07306 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: AI-Guided Defect Detection Techniques to Model Single Crystal Diamond GrowthRohan Reddy Mekala , Elias Garratt , Matthias Muehle , Arjun Srinivasan , Adam Porter , Mikael LindvallComments: 12 pages,4 figures,ACMME 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: From a process development perspective, diamond growth via chemical vapor deposition has made significant strides. However, challenges persist in achieving high quality and large-area material production. These difficulties include controlling conditions to maintain uniform growth rates for the entire growth surface. As growth progresses, various factors or defect states emerge, altering the uniform conditions. These changes affect the growth rate and result in the formation of crystalline defects at the microscale. However, there is a distinct lack of methods to identify these defect states and their geometry using images taken during the growth process. This paper details seminal work on defect segmentation pipeline using in-situ optical images to identify features that indicate defective states that are visible at the macroscale. Using a semantic segmentation approach as applied in our previous work, these defect states and corresponding derivative features are isolated and classified by their pixel masks. Using an annotation focused human-in-the-loop software architecture to produce training datasets, with modules for selective data labeling using active learning, data augmentations, and model-assisted labeling, our approach achieves effective annotation accuracy and drastically reduces the time and cost of labeling by orders of magnitude. On the model development front, we found that deep learning-based algorithms are the most efficient. They can accurately learn complex representations from feature-rich datasets. Our best-performing model, based on the YOLOV3 and DeeplabV3plus architectures, achieved excellent accuracy for specific features of interest. Specifically, it reached 93.35% accuracy for center defects, 92.83% for polycrystalline defects, and 91.98% for edge defects.
- [1113] arXiv:2404.07315 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Structured Reinforcement Learning for Media Streaming at the Wireless EdgeArchana Bura , Sarat Chandra Bobbili , Shreyas Rameshkumar , Desik Rengarajan , Dileep Kalathil , Srinivas ShakkottaiComments: 15 pages, 14 figuresSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$\mu$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.
- [1114] arXiv:2404.07344 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Interactive Learning of Physical Object Properties Through Robot Manipulation and Database of Object MeasurementsAndrej Kruzliak , Jiri Hartvich , Shubhan P. Patni , Lukas Rustler , Jan Kristof Behrens , Fares J. Abu-Dakka , Krystian Mikolajczyk , Ville Kyrki , Matej HoffmannComments: 8 pages, 8 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Abstract: This work presents a framework for automatically extracting physical object properties, such as material composition, mass, volume, and stiffness, through robot manipulation and a database of object measurements. The framework involves exploratory action selection to maximize learning about objects on a table. A Bayesian network models conditional dependencies between object properties, incorporating prior probability distributions and uncertainty associated with measurement actions. The algorithm selects optimal exploratory actions based on expected information gain and updates object properties through Bayesian inference. Experimental evaluation demonstrates effective action selection compared to a baseline and correct termination of the experiments if there is nothing more to be learned. The algorithm proved to behave intelligently when presented with trick objects with material properties in conflict with their appearance. The robot pipeline integrates with a logging module and an online database of objects, containing over 24,000 measurements of 63 objects with different grippers. All code and data are publicly available, facilitating automatic digitization of objects and their physical properties through exploratory manipulations.
- [1115] arXiv:2404.07353 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Addressing the Abstraction and Reasoning Corpus via Procedural Example GenerationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This work presents code to procedurally generate examples for the ARC training tasks. For each of the 400 tasks, an example generator following the transformation logic of the original examples was created. In effect, the assumed underlying distribution of examples for any given task was reverse engineered by implementing a means to sample from it. An attempt was made to cover an as large as reasonable space of possible examples for each task. That is, whenever the original examples of a given task may be limited in their diversity e.g. by having the dimensions of the grids, the set of symbols or number of objects constant or within tight bounds, even though the transformation does not require it, such constraints were lifted. Having access to not just a few examples per task, as the case for ARC, but instead very many, should enable a wide range of experiments that may be important stepping stones towards making leaps on the benchmark.
- [1116] arXiv:2404.07356 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics DataComments: Accepted to the 37th Canadian Artificial Intelligence Conference (2024), 12 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.
- [1117] arXiv:2404.07366 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Differentially Private GANs for Generating Synthetic Indoor Location DataComments: Submitted to International Journal of Information SecuritySubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: The advent of location-based services has led to the widespread adoption of indoor localization systems, which enable location tracking of individuals within enclosed spaces such as buildings. While these systems provide numerous benefits such as improved security and personalized services, they also raise concerns regarding privacy violations. As such, there is a growing need for privacy-preserving solutions that can protect users' sensitive location information while still enabling the functionality of indoor localization systems. In recent years, Differentially Private Generative Adversarial Networks (DPGANs) have emerged as a powerful methodology that aims to protect the privacy of individual data points while generating realistic synthetic data similar to original data. DPGANs combine the power of generative adversarial networks (GANs) with the privacy-preserving technique of differential privacy (DP). In this paper, we introduce an indoor localization framework employing DPGANs in order to generate privacy-preserving indoor location data. We evaluate the performance of our framework on a real-world indoor localization dataset and demonstrate its effectiveness in preserving privacy while maintaining the accuracy of the localization system.
- [1118] arXiv:2404.07377 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Generative Sampling in the Dual Divergence Space: A Data-efficient & Interpretative Approach for Generative AISahil Garg , Anderson Schneider , Anant Raj , Kashif Rasul , Yuriy Nevmyvaka , Sneihil Gopal , Amit Dhurandhar , Guillermo Cecchi , Irina RishSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
Abstract: Building on the remarkable achievements in generative sampling of natural images, we propose an innovative challenge, potentially overly ambitious, which involves generating samples of entire multivariate time series that resemble images. However, the statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects. This issue is especially problematic for deep generative models that follow the conventional approach of generating samples from a canonical distribution and then decoding or denoising them to match the true data distribution. In contrast, our method is grounded in information theory and aims to implicitly characterize the distribution of images, particularly the (global and local) dependency structure between pixels. We achieve this by empirically estimating its KL-divergence in the dual form with respect to the respective marginal distribution. This enables us to perform generative sampling directly in the optimized 1-D dual divergence space. Specifically, in the dual space, training samples representing the data distribution are embedded in the form of various clusters between two end points. In theory, any sample embedded between those two end points is in-distribution w.r.t. the data distribution. Our key idea for generating novel samples of images is to interpolate between the clusters via a walk as per gradients of the dual function w.r.t. the data dimensions. In addition to the data efficiency gained from direct sampling, we propose an algorithm that offers a significant reduction in sample complexity for estimating the divergence of the data distribution with respect to the marginal distribution. We provide strong theoretical guarantees along with an extensive empirical evaluation using many real-world datasets from diverse domains, establishing the superiority of our approach w.r.t. state-of-the-art deep learning methods.
- [1119] arXiv:2404.07383 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Incorporating Explanations into Human-Machine Interfaces for Trust and Situation Awareness in Autonomous VehiclesComments: Accepted to IEEE IV-2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Autonomous vehicles often make complex decisions via machine learning-based predictive models applied to collected sensor data. While this combination of methods provides a foundation for real-time actions, self-driving behavior primarily remains opaque to end users. In this sense, explainability of real-time decisions is a crucial and natural requirement for building trust in autonomous vehicles. Moreover, as autonomous vehicles still cause serious traffic accidents for various reasons, timely conveyance of upcoming hazards to road users can help improve scene understanding and prevent potential risks. Hence, there is also a need to supply autonomous vehicles with user-friendly interfaces for effective human-machine teaming. Motivated by this problem, we study the role of explainable AI and human-machine interface jointly in building trust in vehicle autonomy. We first present a broad context of the explanatory human-machine systems with the "3W1H" (what, whom, when, how) approach. Based on these findings, we present a situation awareness framework for calibrating users' trust in self-driving behavior. Finally, we perform an experiment on our framework, conduct a user study on it, and validate the empirical findings with hypothesis testing.
- [1120] arXiv:2404.07387 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational NotebooksSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Novices frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generation with an additional ephemeral UI step, offering users UI-based scaffolds as an intermediate stage between user prompts and code generation. We present this workflow in BISCUIT, an extension for JupyterLab that provides users with ephemeral UIs generated by LLMs based on the context of their code and intentions, scaffolding users to understand, guide, and explore with LLM-generated code. Through a user study where 10 novices used BISCUIT for machine learning tutorials, we discover that BISCUIT offers user semantic representation of code to aid their understanding, reduces the complexity of prompt engineering, and creates a playground for users to explore different variables and iterate on their ideas. We discuss the implications of our findings for UI-centric interactive paradigm in code generation LLMs.
- [1121] arXiv:2404.07396 (cross-list from econ.GN) [ pdf , ps , html , other ]
-
Title: ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the PastComments: 61 pages, 26 figures; corrected typosSubjects: General Economics (econ.GN) ; Artificial Intelligence (cs.AI)
Abstract: This study investigates whether OpenAI's ChatGPT-3.5 and ChatGPT-4 can accurately forecast future events using two distinct prompting strategies. To evaluate the accuracy of the predictions, we take advantage of the fact that the training data at the time of experiment stopped at September 2021, and ask about events that happened in 2022 using ChatGPT-3.5 and ChatGPT-4. We employed two prompting strategies: direct prediction and what we call future narratives which ask ChatGPT to tell fictional stories set in the future with characters that share events that have happened to them, but after ChatGPT's training data had been collected. Concentrating on events in 2022, we prompted ChatGPT to engage in storytelling, particularly within economic contexts. After analyzing 100 prompts, we discovered that future narrative prompts significantly enhanced ChatGPT-4's forecasting accuracy. This was especially evident in its predictions of major Academy Award winners as well as economic trends, the latter inferred from scenarios where the model impersonated public figures like the Federal Reserve Chair, Jerome Powell. These findings indicate that narrative prompts leverage the models' capacity for hallucinatory narrative construction, facilitating more effective data synthesis and extrapolation than straightforward predictions. Our research reveals new aspects of LLMs' predictive capabilities and suggests potential future applications in analytical contexts.
- [1122] arXiv:2404.07413 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: JetMoE: Reaching Llama2 Performance with 0.1M DollarsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at this https URL .
- [1123] arXiv:2404.07434 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Data-Driven Portfolio Management for Motion Pictures Industry: A New Data-Driven Optimization Methodology Using a Large Language Model as the ExpertSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Portfolio management is one of the unresponded problems of the Motion Pictures Industry (MPI). To design an optimal portfolio for an MPI distributor, it is essential to predict the box office of each project. Moreover, for an accurate box office prediction, it is critical to consider the effect of the celebrities involved in each MPI project, which was impossible with any precedent expert-based method. Additionally, the asymmetric characteristic of MPI data decreases the performance of any predictive algorithm. In this paper, firstly, the fame score of the celebrities is determined using a large language model. Then, to tackle the asymmetric character of MPI's data, projects are classified. Furthermore, the box office prediction takes place for each class of projects. Finally, using a hybrid multi-attribute decision-making technique, the preferability of each project for the distributor is calculated, and benefiting from a bi-objective optimization model, the optimal portfolio is designed.
- [1124] arXiv:2404.07446 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Graph Attention Network for Lane-Wise and Topology-Invariant Intersection Traffic SimulationComments: T-TIS Journal, 12 pages, 8 figures, 4 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Traffic congestion has significant economic, environmental, and social ramifications. Intersection traffic flow dynamics are influenced by numerous factors. While microscopic traffic simulators are valuable tools, they are computationally intensive and challenging to calibrate. Moreover, existing machine-learning approaches struggle to provide lane-specific waveforms or adapt to intersection topology and traffic patterns. In this study, we propose two efficient and accurate "Digital Twin" models for intersections, leveraging Graph Attention Neural Networks (GAT). These attentional graph auto-encoder digital twins capture temporal, spatial, and contextual aspects of traffic within intersections, incorporating various influential factors such as high-resolution loop detector waveforms, signal state records, driving behaviors, and turning-movement counts. Trained on diverse counterfactual scenarios across multiple intersections, our models generalize well, enabling the estimation of detailed traffic waveforms for any intersection approach and exit lanes. Multi-scale error metrics demonstrate that our models perform comparably to microsimulations. The primary application of our study lies in traffic signal optimization, a pivotal area in transportation systems research. These lightweight digital twins can seamlessly integrate into corridor and network signal timing optimization frameworks. Furthermore, our study's applications extend to lane reconfiguration, driving behavior analysis, and facilitating informed decisions regarding intersection safety and efficiency enhancements. A promising avenue for future research involves extending this approach to urban freeway corridors and integrating it with measures of effectiveness metrics.
- [1125] arXiv:2404.07452 (cross-list from q-fin.RM) [ pdf , ps , html , other ]
-
Title: RiskLabs: Predicting Financial Risk Using Large Language Model Based on Multi-Sources DataYupeng Cao , Zhi Chen , Qingyun Pei , Fabrizio Dimino , Lorenzo Ausiello , Prashant Kumar , K.P. Subbalakshmi , Papa Momar NdiayeComments: 24 pages, 7 figures, 5 tables, 1 algorithmSubjects: Risk Management (q-fin.RM) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
Abstract: The integration of Artificial Intelligence (AI) techniques, particularly large language models (LLMs), in finance has garnered increasing academic attention. Despite progress, existing studies predominantly focus on tasks like financial text summarization, question-answering (Q$\&$A), and stock movement prediction (binary classification), with a notable gap in the application of LLMs for financial risk prediction. Addressing this gap, in this paper, we introduce \textbf{RiskLabs}, a novel framework that leverages LLMs to analyze and predict financial risks. RiskLabs uniquely combines different types of financial data, including textual and vocal information from Earnings Conference Calls (ECCs), market-related time series data, and contextual news data surrounding ECC release dates. Our approach involves a multi-stage process: initially extracting and analyzing ECC data using LLMs, followed by gathering and processing time-series data before the ECC dates to model and understand risk over different timeframes. Using multimodal fusion techniques, RiskLabs amalgamates these varied data features for comprehensive multi-task financial risk prediction. Empirical experiment results demonstrate RiskLab's effectiveness in forecasting both volatility and variance in financial markets. Through comparative experiments, we demonstrate how different data sources contribute to financial risk assessment and discuss the critical role of LLMs in this context. Our findings not only contribute to the AI in finance application but also open new avenues for applying LLMs in financial risk assessment.
- [1126] arXiv:2404.07461 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: "Confidently Nonsensical?'': A Critical Survey on the Perspectives and Challenges of 'Hallucinations' in NLPPranav Narayanan Venkit , Tatiana Chakravorti , Vipul Gupta , Heidi Biggs , Mukund Srinath , Koustava Goswami , Sarah Rajtmajer , Shomir WilsonSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We investigate how hallucination in large language models (LLM) is characterized in peer-reviewed literature using a critical examination of 103 publications across NLP research. Through a comprehensive review of sociological and technological literature, we identify a lack of agreement with the term `hallucination.' Additionally, we conduct a survey with 171 practitioners from the field of NLP and AI to capture varying perspectives on hallucination. Our analysis underscores the necessity for explicit definitions and frameworks outlining hallucination within NLP, highlighting potential challenges, and our survey inputs provide a thematic understanding of the influence and ramifications of hallucination in society.
- [1127] arXiv:2404.07471 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Structure-aware Fine-tuning for Code Pre-trained ModelsComments: Accepted by COLING 2024Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Over the past few years, we have witnessed remarkable advancements in Code Pre-trained Models (CodePTMs). These models achieved excellent representation capabilities by designing structure-based pre-training tasks for code. However, how to enhance the absorption of structural knowledge when fine-tuning CodePTMs still remains a significant challenge. To fill this gap, in this paper, we present Structure-aware Fine-tuning (SAT), a novel structure-enhanced and plug-and-play fine-tuning method for CodePTMs. We first propose a structure loss to quantify the difference between the information learned by CodePTMs and the knowledge extracted from code structure. Specifically, we use the attention scores extracted from Transformer layer as the learned structural information, and the shortest path length between leaves in abstract syntax trees as the structural knowledge. Subsequently, multi-task learning is introduced to improve the performance of fine-tuning. Experiments conducted on four pre-trained models and two generation tasks demonstrate the effectiveness of our proposed method as a plug-and-play solution. Furthermore, we observed that SAT can benefit CodePTMs more with limited training data.
- [1128] arXiv:2404.07475 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Laissez-Faire Harms: Algorithmic Biases in Generative Language ModelsComments: 16 pages (43 if including supplementals), 8 figures (23 if including supplementals)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: The rapid deployment of generative language models (LMs) has raised concerns about social biases affecting the well-being of diverse consumers. The extant literature on generative LMs has primarily examined bias via explicit identity prompting. However, prior research on bias in earlier language-based technology platforms, including search engines, has shown that discrimination can occur even when identity terms are not specified explicitly. Studies of bias in LM responses to open-ended prompts (where identity classifications are left unspecified) are lacking and have not yet been grounded in end-consumer harms. Here, we advance studies of generative LM bias by considering a broader set of natural use cases via open-ended prompting. In this "laissez-faire" setting, we find that synthetically generated texts from five of the most pervasive LMs (ChatGPT3.5, ChatGPT4, Claude2.0, Llama2, and PaLM2) perpetuate harms of omission, subordination, and stereotyping for minoritized individuals with intersectional race, gender, and/or sexual orientation identities (AI/AN, Asian, Black, Latine, MENA, NH/PI, Female, Non-binary, Queer). We find widespread evidence of bias to an extent that such individuals are hundreds to thousands of times more likely to encounter LM-generated outputs that portray their identities in a subordinated manner compared to representative or empowering portrayals. We also document a prevalence of stereotypes (e.g. perpetual foreigner) in LM-generated outputs that are known to trigger psychological harms that disproportionately affect minoritized individuals. These include stereotype threat, which leads to impaired cognitive performance and increased negative self-perception. Our findings highlight the urgent need to protect consumers from discriminatory harms caused by language models and invest in critical AI education programs tailored towards empowering diverse consumers.
- [1129] arXiv:2404.07484 (cross-list from cs.MM) [ pdf , ps , other ]
-
Title: Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning ScenariosSubjects: Multimedia (cs.MM) ; Artificial Intelligence (cs.AI)
Abstract: In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic information of instructional videos on learners' emotional states. To deeply explore the impact of video semantic information on learners' emotions, this paper innovatively proposes a multimodal emotion recognition method by fusing video semantic information and physiological signals. We generate video descriptions through a pre-trained large language model (LLM) to obtain high-level semantic information about instructional videos. Using the cross-attention mechanism for modal interaction, the semantic information is fused with the eye movement and PhotoPlethysmoGraphy (PPG) signals to obtain the features containing the critical information of the three modes. The accurate recognition of learners' emotional states is realized through the emotion classifier. The experimental results show that our method has significantly improved emotion recognition performance, providing a new perspective and efficient method for emotion recognition research in MOOC learning scenarios. The method proposed in this paper not only contributes to a deeper understanding of the impact of instructional videos on learners' emotional states but also provides a beneficial reference for future research on emotion recognition in MOOC learning scenarios.
- [1130] arXiv:2404.07493 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Characterizing the Influence of Topology on Graph Learning TasksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph neural networks (GNN) have achieved remarkable success in a wide range of tasks by encoding features combined with topology to create effective representations. However, the fundamental problem of understanding and analyzing how graph topology influences the performance of learning models on downstream tasks has not yet been well understood. In this paper, we propose a metric, TopoInf, which characterizes the influence of graph topology by measuring the level of compatibility between the topological information of graph data and downstream task objectives. We provide analysis based on the decoupled GNNs on the contextual stochastic block model to demonstrate the effectiveness of the metric. Through extensive experiments, we demonstrate that TopoInf is an effective metric for measuring topological influence on corresponding tasks and can be further leveraged to enhance graph learning.
- [1131] arXiv:2404.07498 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Interactive Prompt Debugging with Sequence SalienceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: We present Sequence Salience, a visual tool for interactive prompt debugging with input salience methods. Sequence Salience builds on widely used salience methods for text classification and single-token prediction, and extends this to a system tailored for debugging complex LLM prompts. Our system is well-suited for long texts, and expands on previous work by 1) providing controllable aggregation of token-level salience to the word, sentence, or paragraph level, making salience over long inputs tractable; and 2) supporting rapid iteration where practitioners can act on salience results, refine prompts, and run salience on the new output. We include case studies showing how Sequence Salience can help practitioners work with several complex prompting strategies, including few-shot, chain-of-thought, and constitutional principles. Sequence Salience is built on the Learning Interpretability Tool, an open-source platform for ML model visualizations, and code, notebooks, and tutorials are available at http://goo.gle/sequence-salience.
- [1132] arXiv:2404.07502 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generating Counterfactual Explanations Using Cardinality ConstraintsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Providing explanations about how machine learning algorithms work and/or make particular predictions is one of the main tools that can be used to improve their trusworthiness, fairness and robustness. Among the most intuitive type of explanations are counterfactuals, which are examples that differ from a given point only in the prediction target and some set of features, presenting which features need to be changed in the original example to flip the prediction for that example. However, such counterfactuals can have many different features than the original example, making their interpretation difficult. In this paper, we propose to explicitly add a cardinality constraint to counterfactual generation limiting how many features can be different from the original example, thus providing more interpretable and easily understantable counterfactuals.
- [1133] arXiv:2404.07504 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object ExchangeSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of point cloud scene understanding, particularly in indoor scenes, objects are arranged following human habits, resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies, bypassing the individual object patterns. To address this challenge, we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy, where pairs of objects with comparable sizes are exchanged across different scenes, effectively disentangling the strong contextual dependencies. Subsequently, we introduce a context-aware feature learning strategy, which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques, further showing its better robustness to environmental changes. Moreover, we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets.
- [1134] arXiv:2404.07519 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient TransformerSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI)
Abstract: With the rise of Transformer models in NLP and CV domain, Multi-Head Attention has been proven to be a game-changer. However, its expensive computation poses challenges to the model throughput and efficiency, especially for the long sequence tasks. Exploiting the sparsity in attention has been proven to be an effective way to reduce computation. Nevertheless, prior works do not consider the various distributions among different heads and lack a systematic method to determine the threshold. To address these challenges, we propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE). LATTE employs a headwise threshold-based filter with the low-precision dot product and computation reuse mechanism to reduce the computation of MHA. Moreover, the trainable threshold is introduced to provide a systematic method for adjusting the thresholds and enable end-to-end optimization. Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings with only a minor compromise in performance. Also, the trainable threshold is shown to be essential for the leverage between the performance and the computation. As a result, LATTE filters up to 85.16% keys with only a 0.87% accuracy drop in the CV task and 89.91% keys with a 0.86 perplexity increase in the NLP task.
- [1135] arXiv:2404.07532 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Bayesian Federated Model Compression for Communication and Computation EfficiencySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: In this paper, we investigate Bayesian model compression in federated learning (FL) to construct sparse models that can achieve both communication and computation efficiencies. We propose a decentralized Turbo variational Bayesian inference (D-Turbo-VBI) FL framework where we firstly propose a hierarchical sparse prior to promote a clustered sparse structure in the weight matrix. Then, by carefully integrating message passing and VBI with a decentralized turbo framework, we propose the D-Turbo-VBI algorithm which can (i) reduce both upstream and downstream communication overhead during federated training, and (ii) reduce the computational complexity during local inference. Additionally, we establish the convergence property for thr proposed D-Turbo-VBI algorithm. Simulation results show the significant gain of our proposed algorithm over the baselines in reducing communication overhead during federated training and computational complexity of final model.
- [1136] arXiv:2404.07533 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: IITP-VDLand: A Comprehensive Dataset on Decentraland ParcelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: This paper presents IITP-VDLand, a comprehensive dataset of Decentraland parcels sourced from diverse platforms. Unlike existing datasets which have limited attributes and records, IITP-VDLand offers a rich array of attributes, encompassing parcel characteristics, trading history, past activities, transactions, and social media interactions. Alongside, we introduce a key attribute in the dataset, namely Rarity score, which measures the uniqueness of each parcel within the virtual world. Addressing the significant challenge posed by the dispersed nature of this data across various sources, we employ a systematic approach, utilizing both available APIs and custom scripts, to gather it. Subsequently, we meticulously curate and organize the information into four distinct segments: (1) Characteristics Data-Fragment, (2) OpenSea Trading History Data-Fragment, (3) Ethereum Activity Transactions Data-Fragment, and (4) Social Media Data-Fragment. We envisage that this dataset would serve as a robust resource for training machine- and deep-learning models specifically designed to address real-world challenges within the domain of Decentraland parcels. The performance benchmarking of more than 20 state-of-the-art price prediction models on our dataset yields promising results, achieving a maximum R2 score of 0.8251 and an accuracy of 74.23% in case of Extra Trees Regressor and Classifier. The key findings reveal that the ensemble models performs better than both deep learning and linear models for our dataset. We observe a significant impact of coordinates, geographical proximity, rarity score, and few other economic indicators on the prediction of parcel prices.
- [1137] arXiv:2404.07544 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context ExamplesComments: 50 pages, 48 figures, preprint; Fixed typosSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret.
- [1138] arXiv:2404.07554 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CAT: Contrastive Adapter Training for Personalized Image GenerationComments: CVPRW 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the corruption of the backbone model's prior knowledge. One of the well known phenomena is the loss of diversity in object generation, especially within the same class which leads to generating almost identical objects with minor variations. This poses challenges in generation capabilities. To solve this issue, we present Contrastive Adapter Training (CAT), a simple yet effective strategy to enhance adapter training through the application of CAT loss. Our approach facilitates the preservation of the base model's original knowledge when the model initiates adapters. Furthermore, we introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to keep the former information. We qualitatively and quantitatively compare CAT's improvement. Finally, we mention the possibility of CAT in the aspects of multi-concept adapter and optimization.
- [1139] arXiv:2404.07559 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Differentially Private Reinforcement Learning with Self-PlayComments: 32 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
Abstract: We study the problem of multi-agent reinforcement learning (multi-agent RL) with differential privacy (DP) constraints. This is well-motivated by various real-world applications involving sensitive data, where it is critical to protect users' private information. We first extend the definitions of Joint DP (JDP) and Local DP (LDP) to two-player zero-sum episodic Markov Games, where both definitions ensure trajectory-wise privacy protection. Then we design a provably efficient algorithm based on optimistic Nash value iteration and privatization of Bernstein-type bonuses. The algorithm is able to satisfy JDP and LDP requirements when instantiated with appropriate privacy mechanisms. Furthermore, for both notions of DP, our regret bound generalizes the best known result under the single-agent RL case, while our regret could also reduce to the best known result for multi-agent RL without privacy constraints. To the best of our knowledge, these are the first line of results towards understanding trajectory-wise privacy protection in multi-agent RL.
- [1140] arXiv:2404.07560 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Socially Pertinent Robots in Gerontological HealthcareXavier Alameda-Pineda , Angus Addlesee , Daniel Hernández García , Chris Reinke , Soraya Arias , Federica Arrigoni , Alex Auternaud , Lauriane Blavette , Cigdem Beyan , Luis Gomez Camara , Ohad Cohen , Alessandro Conti , Sébastien Dacunha , Christian Dondrup , Yoav Ellinson , Francesco Ferro , Sharon Gannot , Florian Gras , Nancie Gunson , Radu Horaud , Moreno D'Incà , Imad Kimouche , Séverin Lemaignan , Oliver Lemon , Cyril Liotard , Luca Marchionni , Mordehay Moradi , Tomas Pajdla , Maribel Pino , Michal Polic , Matthieu Py , Ariel Rado , Bin Ren , Elisa Ricci , Anne-Sophie Rigaud , Paolo Rota , Marta Romeo , Nicu Sebe , Weronika Sieińska , Pinchas Tandeitnik , Francesco Tonini , Nicolas Turro , Timothée Wintz , Yanchao YuSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.
- [1141] arXiv:2404.07569 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios.
A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark. - [1142] arXiv:2404.07572 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairingComments: The article has been accepted by IEEE International Conference on Multimedia and Expo 2024Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Neural networks have increasingly influenced people's lives. Ensuring the faithful deployment of neural networks as designed by their model owners is crucial, as they may be susceptible to various malicious or unintentional modifications, such as backdooring and poisoning attacks. Fragile model watermarks aim to prevent unexpected tampering that could lead DNN models to make incorrect decisions. They ensure the detection of any tampering with the model as sensitively as possible.However, prior watermarking methods suffered from inefficient sample generation and insufficient sensitivity, limiting their practical applicability. Our approach employs a sample-pairing technique, placing the model boundaries between pairs of samples, while simultaneously maximizing logits. This ensures that the model's decision results of sensitive samples change as much as possible and the Top-1 labels easily alter regardless of the direction it moves.
- [1143] arXiv:2404.07575 (cross-list from cs.SD) [ pdf , ps , other ]
-
Title: An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced DistributionComments: Accepted to NAACL 2024 FindingsSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner's speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy.
- [1144] arXiv:2404.07605 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Contrastive-Based Deep Embeddings for Label Noise-Resilient Histopathology Image ClassificationLucas Dedieu , Nicolas Nerrienet , Adrien Nivaggioli , Clara Simmat , Marceau Clavel , Arnaud Gauthier , Stéphane Sockeel , Rémy PeyretComments: 16 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in deep learning have proven highly effective in medical image classification, notably within histopathology. However, noisy labels represent a critical challenge in histopathology image classification, where accurate annotations are vital for training robust deep learning models. Indeed, deep neural networks can easily overfit label noise, leading to severe degradations in model performance. While numerous public pathology foundation models have emerged recently, none have evaluated their resilience to label noise. Through thorough empirical analyses across multiple datasets, we exhibit the label noise resilience property of embeddings extracted from foundation models trained in a self-supervised contrastive manner. We demonstrate that training with such embeddings substantially enhances label noise robustness when compared to non-contrastive-based ones as well as commonly used noise-resilient methods. Our results unequivocally underline the superiority of contrastive learning in effectively mitigating the label noise challenge. Code is publicly available at this https URL .
- [1145] arXiv:2404.07611 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: NoticIA: A Clickbait Article Summarization Dataset in SpanishComments: Under review in the journal Procesamiento del Lenguaje NaturalSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative summarizations written by humans. This task demands advanced text understanding and summarization abilities, challenging the models' capacity to infer and connect diverse pieces of information to meet the user's informational needs generated by the clickbait headline. We evaluate the Spanish text comprehension capabilities of a wide range of state-of-the-art large language models. Additionally, we use the dataset to train ClickbaitFighter, a task-specific model that achieves near-human performance in this task.
- [1146] arXiv:2404.07613 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical DomainIker García-Ferrero , Rodrigo Agerri , Aitziber Atutxa Salazar , Elena Cabrio , Iker de la Iglesia , Alberto Lavelli , Bernardo Magnini , Benjamin Molinet , Johana Ramirez-Romero , German Rigau , Jose Maria Villa-Gonzalez , Serena Villata , Andrea ZaninelloComments: LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.
- [1147] arXiv:2404.07662 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: PINNACLE: PINN Adaptive ColLocation and Experimental points selectionComments: Accepted to 12th International Conference on Learning Representations (ICLR 2024), 36 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
Abstract: Physics-Informed Neural Networks (PINNs), which incorporate PDEs as soft constraints, train with a composite loss function that contains multiple training point types: different types of collocation points chosen during training to enforce each PDE and initial/boundary conditions, and experimental points which are usually costly to obtain via experiments or simulations. Training PINNs using this loss function is challenging as it typically requires selecting large numbers of points of different types, each with different training dynamics. Unlike past works that focused on the selection of either collocation or experimental points, this work introduces PINN Adaptive ColLocation and Experimental points selection (PINNACLE), the first algorithm that jointly optimizes the selection of all training point types, while automatically adjusting the proportion of collocation point types as training progresses. PINNACLE uses information on the interaction among training point types, which had not been considered before, based on an analysis of PINN training dynamics via the Neural Tangent Kernel (NTK). We theoretically show that the criterion used by PINNACLE is related to the PINN generalization error, and empirically demonstrate that PINNACLE is able to outperform existing point selection methods for forward, inverse, and transfer learning problems.
- [1148] arXiv:2404.07663 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Interactive Ontology Matching with Cost-Efficient LearningSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches.
To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency. - [1149] arXiv:2404.07664 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Finding Dino: A plug-and-play framework for unsupervised detection of out-of-distribution objects using prototypesPoulami Sinhamahapatra , Franziska Schwaiger , Shirsha Bose , Huiyu Wang , Karsten Roscher , Stephan GuennemannSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Detecting and localising unknown or Out-of-distribution (OOD) objects in any scene can be a challenging task in vision. Particularly, in safety-critical cases involving autonomous systems like automated vehicles or trains. Supervised anomaly segmentation or open-world object detection models depend on training on exhaustively annotated datasets for every domain and still struggle in distinguishing between background and OOD objects. In this work, we present a plug-and-play generalised framework - PRototype-based zero-shot OOD detection Without Labels (PROWL). It is an inference-based method that does not require training on the domain dataset and relies on extracting relevant features from self-supervised pre-trained models. PROWL can be easily adapted to detect OOD objects in any operational design domain by specifying a list of known classes from this domain. PROWL, as an unsupervised method, outperforms other supervised methods trained without auxiliary OOD data on the RoadAnomaly and RoadObstacle datasets provided in SegmentMeIfYouCan (SMIYC) benchmark. We also demonstrate its suitability for other domains such as rail and maritime scenes.
- [1150] arXiv:2404.07676 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Model-based Cleaning of the QUILT-1M Pathology Dataset for Text-Conditional Image SynthesisComments: 4 pages (short paper)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The QUILT-1M dataset is the first openly available dataset containing images harvested from various online sources. While it provides a huge data variety, the image quality and composition is highly heterogeneous, impacting its utility for text-conditional image synthesis. We propose an automatic pipeline that provides predictions of the most common impurities within the images, e.g., visibility of narrators, desktop environment and pathology software, or text within the image. Additionally, we propose to use semantic alignment filtering of the image-text pairs. Our findings demonstrate that by rigorously filtering the dataset, there is a substantial enhancement of image fidelity in text-to-image tasks.
- [1151] arXiv:2404.07677 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ODA: Observation-Driven Agent for integrating LLMs and Knowledge GraphsComments: LLM+KGSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The integration of Large Language Models (LLMs) and knowledge graphs (KGs) has achieved remarkable success in various natural language processing tasks. However, existing methodologies that integrate LLMs and KGs often navigate the task-solving process solely based on the LLM's analysis of the question, overlooking the rich cognitive potential inherent in the vast knowledge encapsulated in KGs. To address this, we introduce Observation-Driven Agent (ODA), a novel AI agent framework tailored for tasks involving KGs. ODA incorporates KG reasoning abilities via global observation that enhances reasoning capabilities through a cyclical paradigm of observation, action, and reflection. Confronting the exponential explosion of knowledge during observation, we innovatively design a recursive observation mechanism. Subsequently, we integrate the observed knowledge into the action and reflection modules. Through extensive experiments, ODA demonstrates state-of-the-art performance on several datasets, notably achieving accuracy improvements of 12.87% and 8.9%.
- [1152] arXiv:2404.07685 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Run-time Monitoring of 3D Object Detection in Automated Driving Systems Using Early Layer Neural Activation PatternsComments: Accepted by CVPR 2024 Workshop on Safe Autonomy for All Domains (SAIAD)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Monitoring the integrity of object detection for errors within the perception module of automated driving systems (ADS) is paramount for ensuring safety. Despite recent advancements in deep neural network (DNN)-based object detectors, their susceptibility to detection errors, particularly in the less-explored realm of 3D object detection, remains a significant concern. State-of-the-art integrity monitoring (also known as introspection) mechanisms in 2D object detection mainly utilise the activation patterns in the final layer of the DNN-based detector's backbone. However, that may not sufficiently address the complexities and sparsity of data in 3D object detection. To this end, we conduct, in this article, an extensive investigation into the effects of activation patterns extracted from various layers of the backbone network for introspecting the operation of 3D object detectors. Through a comparative analysis using Kitti and NuScenes datasets with PointPillars and CenterPoint detectors, we demonstrate that using earlier layers' activation patterns enhances the error detection performance of the integrity monitoring system, yet increases computational complexity. To address the real-time operation requirements in ADS, we also introduce a novel introspection method that combines activation patterns from multiple layers of the detector's backbone and report its performance.
- [1153] arXiv:2404.07686 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Depth Estimation using Weighted-loss and Transfer LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Depth estimation from 2D images is a common computer vision task that has applications in many fields including autonomous vehicles, scene understanding and robotics. The accuracy of a supervised depth estimation method mainly relies on the chosen loss function, the model architecture, quality of data and performance metrics. In this study, we propose a simplified and adaptable approach to improve depth estimation accuracy using transfer learning and an optimized loss function. The optimized loss function is a combination of weighted losses to which enhance robustness and generalization: Mean Absolute Error (MAE), Edge Loss and Structural Similarity Index (SSIM). We use a grid search and a random search method to find optimized weights for the losses, which leads to an improved model. We explore multiple encoder-decoder-based models including DenseNet121, DenseNet169, DenseNet201, and EfficientNet for the supervised depth estimation model on NYU Depth Dataset v2. We observe that the EfficientNet model, pre-trained on ImageNet for classification when used as an encoder, with a simple upsampling decoder, gives the best results in terms of RSME, REL and log10: 0.386, 0.113 and 0.049, respectively. We also perform a qualitative analysis which illustrates that our model produces depth maps that closely resemble ground truth, even in cases where the ground truth is flawed. The results indicate significant improvements in accuracy and robustness, with EfficientNet being the most successful architecture.
- [1154] arXiv:2404.07698 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Point Cloud Geometry Scalable Coding with a Quality-Conditioned Latents Probability EstimatorComments: Submitted at ICIP 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The widespread usage of point clouds (PC) for immersive visual applications has resulted in the use of very heterogeneous receiving conditions and devices, notably in terms of network, hardware, and display capabilities. In this scenario, quality scalability, i.e., the ability to reconstruct a signal at different qualities by progressively decoding a single bitstream, is a major requirement that has yet to be conveniently addressed, notably in most learning-based PC coding solutions. This paper proposes a quality scalability scheme, named Scalable Quality Hyperprior (SQH), adaptable to learning-based static point cloud geometry codecs, which uses a Quality-conditioned Latents Probability Estimator (QuLPE) to decode a high-quality version of a PC learning-based representation, based on an available lower quality base layer. SQH is integrated in the future JPEG PC coding standard, allowing to create a layered bitstream that can be used to progressively decode the PC geometry with increasing quality and fidelity. Experimental results show that SQH offers the quality scalability feature with very limited or no compression performance penalty at all when compared with the corresponding non-scalable solution, thus preserving the significant compression gains over other state-of-the-art PC codecs.
- [1155] arXiv:2404.07724 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Abstract: Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We thus restrict it to a specific range of noise levels, improving both the inference speed and result quality. This limited guidance interval improves the record FID in ImageNet-512 significantly, from 1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial across different sampler parameters, network architectures, and datasets, including the large-scale setting of Stable Diffusion XL. We thus suggest exposing the guidance interval as a hyperparameter in all diffusion models that use guidance.
- [1156] arXiv:2404.07725 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language ModelsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: The field of eXplainable artificial intelligence (XAI) has produced a plethora of methods (e.g., saliency-maps) to gain insight into artificial intelligence (AI) models, and has exploded with the rise of deep learning (DL). However, human-participant studies question the efficacy of these methods, particularly when the AI output is wrong. In this study, we collected and analyzed 156 human-generated text and saliency-based explanations collected in a question-answering task (N=40) and compared them empirically to state-of-the-art XAI explanations (integrated gradients, conservative LRP, and ChatGPT) in a human-participant study (N=136). Our findings show that participants found human saliency maps to be more helpful in explaining AI answers than machine saliency maps, but performance negatively correlated with trust in the AI model and explanations. This finding hints at the dilemma of AI errors in explanation, where helpful explanations can lead to lower task performance when they support wrong AI predictions.
- [1157] arXiv:2404.07735 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Diffusing in Someone Else's Shoes: Robotic Perspective Taking with DiffusionSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Humanoid robots can benefit from their similarity to the human shape by learning from humans. When humans teach other humans how to perform actions, they often demonstrate the actions and the learning human can try to imitate the demonstration. Being able to mentally transfer from a demonstration seen from a third-person perspective to how it should look from a first-person perspective is fundamental for this ability in humans. As this is a challenging task, it is often simplified for robots by creating a demonstration in the first-person perspective. Creating these demonstrations requires more effort but allows for an easier imitation. We introduce a novel diffusion model aimed at enabling the robot to directly learn from the third-person demonstrations. Our model is capable of learning and generating the first-person perspective from the third-person perspective by translating the size and rotations of objects and the environment between two perspectives. This allows us to utilise the benefits of easy-to-produce third-person demonstrations and easy-to-imitate first-person demonstrations. The model can either represent the first-person perspective in an RGB image or calculate the joint values. Our approach significantly outperforms other image-to-image models in this task.
- [1158] arXiv:2404.07738 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature. Specifically, starting with a core paper as the primary focus to generate ideas, our ResearchAgent is augmented not only with relevant publications through connecting information over an academic graph but also entities retrieved from an entity-centric knowledge store based on their underlying concepts, mined and shared across numerous papers. In addition, mirroring the human approach to iteratively improving ideas with peer discussions, we leverage multiple ReviewingAgents that provide reviews and feedback iteratively. Further, they are instantiated with human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showcasing its effectiveness in generating novel, clear, and valid research ideas based on human and model-based evaluation results.
- [1159] arXiv:2404.07751 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Generating consistent PDDL domains with Large Language ModelsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) are capable of transforming natural language domain descriptions into plausibly looking PDDL markup. However, ensuring that actions are consistent within domains still remains a challenging task. In this paper we present a novel concept to significantly improve the quality of LLM-generated PDDL models by performing automated consistency checking during the generation process. Although the proposed consistency checking strategies still can't guarantee absolute correctness of generated models, they can serve as valuable source of feedback reducing the amount of correction efforts expected from a human in the loop. We demonstrate the capabilities of our error detection approach on a number of classical and custom planning domains (logistics, gripper, tyreworld, household, pizza).
- [1160] arXiv:2404.07754 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Generating Synthetic Satellite Imagery With Deep-Learning Text-to-Image Models -- Technical Challenges and Implications for Monitoring and VerificationComments: this https URLJournal-ref: Presented at the Annual Meeting of the Institute of Nuclear Materials Management (INMM), Vienna, 2023Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Novel deep-learning (DL) architectures have reached a level where they can generate digital media, including photorealistic images, that are difficult to distinguish from real data. These technologies have already been used to generate training data for Machine Learning (ML) models, and large text-to-image models like DALL-E 2, Imagen, and Stable Diffusion are achieving remarkable results in realistic high-resolution image generation. Given these developments, issues of data authentication in monitoring and verification deserve a careful and systematic analysis: How realistic are synthetic images? How easily can they be generated? How useful are they for ML researchers, and what is their potential for Open Science? In this work, we use novel DL models to explore how synthetic satellite images can be created using conditioning mechanisms. We investigate the challenges of synthetic satellite image generation and evaluate the results based on authenticity and state-of-the-art metrics. Furthermore, we investigate how synthetic data can alleviate the lack of data in the context of ML methods for remote-sensing. Finally we discuss implications of synthetic satellite imagery in the context of monitoring and verification.
- [1161] arXiv:2404.07765 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat ReportsLukas Lange , Marc Müller , Ghazaleh Haratinezhad Torbati , Dragan Milchevski , Patrick Grau , Subhash Pujari , Annemarie FriedrichComments: Accepted at LREC-COLING 2024. Corpus available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.
- [1162] arXiv:2404.07775 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Discourse-Aware In-Context Learning for Temporal Expression NormalizationComments: Accepted at NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Temporal expression (TE) normalization is a well-studied problem. However, the predominately used rule-based systems are highly restricted to specific settings, and upcoming machine learning approaches suffer from a lack of labeled data. In this work, we explore the feasibility of proprietary and open-source large language models (LLMs) for TE normalization using in-context learning to inject task, document, and example information into the model. We explore various sample selection strategies to retrieve the most relevant set of examples. By using a window-based prompt design approach, we can perform TE normalization across sentences, while leveraging the LLM knowledge without training the model. Our experiments show competitive results to models designed for this task. In particular, our method achieves large performance improvements for non-standard settings by dynamically including relevant examples during inference.
- [1163] arXiv:2404.07788 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Scene graph generation (SGG) aims to understand the visual objects and their semantic relationships from one given image. Until now, lots of SGG datasets with the eyelevel view are released but the SGG dataset with the overhead view is scarcely studied. By contrast to the object occlusion problem in the eyelevel view, which impedes the SGG, the overhead view provides a new perspective that helps to promote the SGG by providing a clear perception of the spatial relationships of objects in the ground scene. To fill in the gap of the overhead view dataset, this paper constructs and releases an aerial image urban scene graph generation (AUG) dataset. Images from the AUG dataset are captured with the low-attitude overhead view. In the AUG dataset, 25,594 objects, 16,970 relationships, and 27,175 attributes are manually annotated. To avoid the local context being overwhelmed in the complex aerial urban scene, this paper proposes one new locality-preserving graph convolutional network (LPG). Different from the traditional graph convolutional network, which has the natural advantage of capturing the global context for SGG, the convolutional layer in the LPG integrates the non-destructive initial features of the objects with dynamically updated neighborhood information to preserve the local context under the premise of mining the global context. To address the problem that there exists an extra-large number of potential object relationship pairs but only a small part of them is meaningful in AUG, we propose the adaptive bounding box scaling factor for potential relationship detection (ABS-PRD) to intelligently prune the meaningless relationship pairs. Extensive experiments on the AUG dataset show that our LPG can significantly outperform the state-of-the-art methods and the effectiveness of the proposed locality-preserving strategy.
- [1164] arXiv:2404.07815 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Post-Hoc Reversal: Are We Selecting Models Prematurely?Comments: 9 pages + references + appendix, 7 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical study. In particular, we demonstrate a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying these post-hoc transforms. This phenomenon is especially prominent in high-noise settings. For example, while base models overfit badly early in training, both conventional ensembling and SWA favor base models trained for more epochs. Post-hoc reversal can also suppress the appearance of double descent and mitigate mismatches between test loss and test error seen in base models. Based on our findings, we propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stopping, checkpointing, and broader hyperparameter choices. Our experimental analyses span real-world vision, language, tabular and graph datasets from domains like satellite imaging, language modeling, census prediction and social network analysis. On an LLM instruction tuning dataset, post-hoc selection results in > 1.5x MMLU improvement compared to naive selection. Code is available at this https URL .
- [1165] arXiv:2404.07817 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Calibration of Continual Learning ModelsComments: Accepted at CLVISION workshop, CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Continual Learning (CL) focuses on maximizing the predictive performance of a model across a non-stationary stream of data. Unfortunately, CL models tend to forget previous knowledge, thus often underperforming when compared with an offline model trained jointly on the entire data stream. Given that any CL model will eventually make mistakes, it is of crucial importance to build calibrated CL models: models that can reliably tell their confidence when making a prediction. Model calibration is an active research topic in machine learning, yet to be properly investigated in CL. We provide the first empirical study of the behavior of calibration approaches in CL, showing that CL strategies do not inherently learn calibrated models. To mitigate this issue, we design a continual calibration approach that improves the performance of post-processing calibration methods over a wide range of different benchmarks and CL strategies. CL does not necessarily need perfect predictive models, but rather it can benefit from reliable predictive models. We believe our study on continual calibration represents a first step towards this direction.
- [1166] arXiv:2404.07826 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: On the Sample Efficiency of Abstractions and Potential-Based Reward Shaping in Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The use of Potential Based Reward Shaping (PBRS) has shown great promise in the ongoing research effort to tackle sample inefficiency in Reinforcement Learning (RL). However, the choice of the potential function is critical for this technique to be effective. Additionally, RL techniques are usually constrained to use a finite horizon for computational limitations. This introduces a bias when using PBRS, thus adding an additional layer of complexity. In this paper, we leverage abstractions to automatically produce a "good" potential function. We analyse the bias induced by finite horizons in the context of PBRS producing novel insights. Finally, to asses sample efficiency and performance impact, we evaluate our approach on four environments including a goal-oriented navigation task and three Arcade Learning Environments (ALE) games demonstrating that we can reach the same level of performance as CNN-based solutions with a simple fully-connected network.
- [1167] arXiv:2404.07839 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: RecurrentGemma: Moving Past Transformers for Efficient Open Language ModelsAleksandar Botev , Soham De , Samuel L Smith , Anushan Fernando , George-Cristian Muraru , Ruba Haroun , Leonard Berrada , Razvan Pascanu , Pier Giuseppe Sessa , Robert Dadashi , Léonard Hussenot , Johan Ferret , Sertan Girgin , Olivier Bachem , Alek Andreev , Kathleen Kenealy , Thomas Mesnard , Cassidy Hardin , Surya Bhupatiraju , Shreya Pathak , Laurent Sifre , Morgane Rivière , Mihir Sanjay Kale , Juliette Love , Pouya Tafti , Armand Joulin , Noah Fiedel , Evan Senter , Yutian Chen , Srivatsan Srinivasan , Guillaume Desjardins , David Budden , Arnaud Doucet , Sharad Vikram , Adam Paszke , Trevor Gale , Sebastian Borgeaud , Charlie Chen , Andy Brock , Antonia Paterson , Jenny Brennan , Meg Risdal , Raj Gundluru , Nesh Devanathan , Paul Mooney , Nilay Chauhan , Phil Culliton , Luiz GUStavo Martins , Elisa Bandy , David Huntsperger , Glenn Cameron , Arthur Zucker , Tris Warkentin , Ludovic Peran , Minh Giang , Zoubin Ghahramani , Clément Farabet , Koray Kavukcuoglu , Demis Hassabis , Raia Hadsell , Yee Whye Teh , Nando de FrietasSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.
- [1168] arXiv:2404.07850 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MindBridge: A Cross-Subject Brain Decoding FrameworkComments: CVPR 2024 highlight. Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Brain decoding, a pivotal field in neuroscience, aims to reconstruct stimuli from acquired brain signals, primarily utilizing functional magnetic resonance imaging (fMRI). Currently, brain decoding is confined to a per-subject-per-model paradigm, limiting its applicability to the same individual for whom the decoding model is trained. This constraint stems from three key challenges: 1) the inherent variability in input dimensions across subjects due to differences in brain size; 2) the unique intrinsic neural patterns, influencing how different individuals perceive and process sensory information; 3) limited data availability for new subjects in real-world scenarios hampers the performance of decoding models. In this paper, we present a novel approach, MindBridge, that achieves cross-subject brain decoding by employing only one model. Our proposed framework establishes a generic paradigm capable of addressing these challenges by introducing biological-inspired aggregation function and novel cyclic fMRI reconstruction mechanism for subject-invariant representation learning. Notably, by cycle reconstruction of fMRI, MindBridge can enable novel fMRI synthesis, which also can serve as pseudo data augmentation. Within the framework, we also devise a novel reset-tuning method for adapting a pretrained model to a new subject. Experimental results demonstrate MindBridge's ability to reconstruct images for multiple subjects, which is competitive with dedicated subject-specific models. Furthermore, with limited data for a new subject, we achieve a high level of decoding accuracy, surpassing that of subject-specific models. This advancement in cross-subject brain decoding suggests promising directions for wider applications in neuroscience and indicates potential for more efficient utilization of limited fMRI data in real-world scenarios. Project page: this https URL
- [1169] arXiv:2404.07851 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Guiding Large Language Models to Post-Edit Machine Translation with Error AnnotationsComments: 21 pages, 8 figuresJournal-ref: NAACL 2024 FindingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Machine Translation (MT) remains one of the last NLP tasks where large language models (LLMs) have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations. Working with LLaMA-2 models, we consider prompting strategies varying the nature of feedback provided and then fine-tune the LLM to improve its ability to exploit the provided guidance. Through experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores, although the benefits of fine-grained feedback are not clear. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation.
- [1170] arXiv:2404.07883 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Apprentice Tutor Builder: A Platform For Users to Create and Personalize Intelligent TutorsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Intelligent tutoring systems (ITS) are effective for improving students' learning outcomes. However, their development is often complex, time-consuming, and requires specialized programming and tutor design knowledge, thus hindering their widespread application and personalization. We present the Apprentice Tutor Builder (ATB) , a platform that simplifies tutor creation and personalization. Instructors can utilize ATB's drag-and-drop tool to build tutor interfaces. Instructors can then interactively train the tutors' underlying AI agent to produce expert models that can solve problems. Training is achieved via using multiple interaction modalities including demonstrations, feedback, and user labels. We conducted a user study with 14 instructors to evaluate the effectiveness of ATB's design with end users. We found that users enjoyed the flexibility of the interface builder and ease and speed of agent teaching, but often desired additional time-saving features. With these insights, we identified a set of design recommendations for our platform and others that utilize interactive AI agents for tutor creation and customization.
- [1171] arXiv:2404.07900 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: High-Dimension Human Value Representation in Large Language ModelsSamuel Cahyawijaya , Delong Chen , Yejin Bang , Leila Khalatbari , Bryan Wilie , Ziwei Ji , Etsuko Ishii , Pascale FungSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The widespread application of Large Language Models (LLMs) across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, ranging from Reinforcement Learning with Human Feedback (RLHF), to constitutional learning, etc. there is an urgent need to understand the scope and nature of human values injected into these models before their release. There is also a need for model alignment without a costly large scale human annotation effort. We propose UniVaR, a high-dimensional representation of human value distributions in LLMs, orthogonal to model architecture and training data. Trained from the value-relevant output of eight multilingual LLMs and tested on the output from four multilingual LLMs, namely LlaMA2, ChatGPT, JAIS and Yi, we show that UniVaR is a powerful tool to compare the distribution of human values embedded in different LLMs with different langauge sources. Through UniVaR, we explore how different LLMs prioritize various values in different languages and cultures, shedding light on the complex interplay between human values and language modeling.
- [1172] arXiv:2404.07919 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Low-rank Adaptation for Spatio-Temporal ForecastingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Spatio-temporal forecasting is crucial in real-world dynamic systems, predicting future changes using historical data from diverse locations. Existing methods often prioritize the development of intricate neural networks to capture the complex dependencies of the data, yet their accuracy fails to show sustained improvement. Besides, these methods also overlook node heterogeneity, hindering customized prediction modules from handling diverse regional nodes effectively. In this paper, our goal is not to propose a new model but to present a novel low-rank adaptation framework as an off-the-shelf plugin for existing spatial-temporal prediction models, termed ST-LoRA, which alleviates the aforementioned problems through node-level adjustments. Specifically, we first tailor a node adaptive low-rank layer comprising multiple trainable low-rank matrices. Additionally, we devise a multi-layer residual fusion stacking module, injecting the low-rank adapters into predictor modules of various models. Across six real-world traffic datasets and six different types of spatio-temporal prediction models, our approach minimally increases the parameters and training time of the original models by less than 4%, still achieving consistent and sustained performance enhancement.
- [1173] arXiv:2404.07926 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Leveraging Large Language Models (LLMs) to Support Collaborative Human-AI Online Risk Data AnnotationComments: This paper has been peer-reviewed and presented at the "CHI 2024 Workshop on LLMs as Research Tools: Applications and Evaluations in HCI Data Work, May 12, 2024, Honolulu, HI, USA."Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: In this position paper, we discuss the potential for leveraging LLMs as interactive research tools to facilitate collaboration between human coders and AI to effectively annotate online risk data at scale. Collaborative human-AI labeling is a promising approach to annotating large-scale and complex data for various tasks. Yet, tools and methods to support effective human-AI collaboration for data annotation are under-studied. This gap is pertinent because co-labeling tasks need to support a two-way interactive discussion that can add nuance and context, particularly in the context of online risk, which is highly subjective and contextualized. Therefore, we provide some of the early benefits and challenges of using LLMs-based tools for risk annotation and suggest future directions for the HCI research community to leverage LLMs as research tools to facilitate human-AI collaboration in contextualized online data annotation. Our research interests align very well with the purposes of the LLMs as Research Tools workshop to identify ongoing applications and challenges of using LLMs to work with data in HCI research. We anticipate learning valuable insights from organizers and participants into how LLMs can help reshape the HCI community's methods for working with data.
- [1174] arXiv:2404.07930 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Parameter Hierarchical Optimization for Visible-Infrared Person Re-IdentificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Visible-infrared person re-identification (VI-reID) aims at matching cross-modality pedestrian images captured by disjoint visible or infrared cameras. Existing methods alleviate the cross-modality discrepancies via designing different kinds of network architectures. Different from available methods, in this paper, we propose a novel parameter optimizing paradigm, parameter hierarchical optimization (PHO) method, for the task of VI-ReID. It allows part of parameters to be directly optimized without any training, which narrows the search space of parameters and makes the whole network more easier to be trained. Specifically, we first divide the parameters into different types, and then introduce a self-adaptive alignment strategy (SAS) to automatically align the visible and infrared images through transformation. Considering that features in different dimension have varying importance, we develop an auto-weighted alignment learning (AAL) module that can automatically weight features according to their importance. Importantly, in the alignment process of SAS and AAL, all the parameters are immediately optimized with optimization principles rather than training the whole network, which yields a better parameter training manner. Furthermore, we establish the cross-modality consistent learning (CCL) loss to extract discriminative person representations with translation consistency. We provide both theoretical justification and empirical evidence that our proposed PHO method outperform existing VI-reID approaches.
- [1175] arXiv:2404.07941 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: SiGNN: A Spike-induced Graph Neural Network for Dynamic Graph Representation LearningSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the domain of dynamic graph representation learning (DGRL), the efficient and comprehensive capture of temporal evolution within real-world networks is crucial. Spiking Neural Networks (SNNs), known as their temporal dynamics and low-power characteristic, offer an efficient solution for temporal processing in DGRL task. However, owing to the spike-based information encoding mechanism of SNNs, existing DGRL methods employed SNNs face limitations in their representational capacity. Given this issue, we propose a novel framework named Spike-induced Graph Neural Network (SiGNN) for learning enhanced spatialtemporal representations on dynamic graphs. In detail, a harmonious integration of SNNs and GNNs is achieved through an innovative Temporal Activation (TA) mechanism. Benefiting from the TA mechanism, SiGNN not only effectively exploits the temporal dynamics of SNNs but also adeptly circumvents the representational constraints imposed by the binary nature of spikes. Furthermore, leveraging the inherent adaptability of SNNs, we explore an in-depth analysis of the evolutionary patterns within dynamic graphs across multiple time granularities. This approach facilitates the acquisition of a multiscale temporal node representation.Extensive experiments on various real-world dynamic graph datasets demonstrate the superior performance of SiGNN in the node classification task.
- [1176] arXiv:2404.07942 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: On Unified Prompt Tuning for Request Quality Assurance in Public Code ReviewComments: The 29th International Conference on Database Systems for Advanced Applications (DASFAA)Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Public Code Review (PCR) can be implemented through a Software Question Answering (SQA) community, which facilitates high knowledge dissemination. Current methods mainly focus on the reviewer's perspective, including finding a capable reviewer, predicting comment quality, and recommending/generating review comments. Our intuition is that satisfying review necessity requests can increase their visibility, which in turn is a prerequisite for better review responses. To this end, we propose a unified framework called UniPCR to complete developer-based request quality assurance (i.e., predicting request necessity and recommending tags subtask) under a Masked Language Model (MLM). Specifically, we reformulate both subtasks via 1) text prompt tuning, which converts two subtasks into MLM by constructing prompt templates using hard prompt; 2) code prefix tuning, which optimizes a small segment of generated continuous vectors as the prefix of the code representation using soft prompt. Experimental results on the Public Code Review dataset for the time span 2011-2022 demonstrate that our UniPCR framework adapts to the two subtasks and outperforms comparable accuracy-based results with state-of-the-art methods for request quality assurance. These conclusions highlight the effectiveness of our unified framework from the developer's perspective in public code review.
- [1177] arXiv:2404.07946 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Faster Training of Diffusion Models: An Inspiration of A Consistency PhenomenonSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Diffusion models (DMs) are a powerful generative framework that have attracted significant attention in recent years. However, the high computational cost of training DMs limits their practical applications. In this paper, we start with a consistency phenomenon of DMs: we observe that DMs with different initializations or even different architectures can produce very similar outputs given the same noise inputs, which is rare in other generative models. We attribute this phenomenon to two factors: (1) the learning difficulty of DMs is lower when the noise-prediction diffusion model approaches the upper bound of the timestep (the input becomes pure noise), where the structural information of the output is usually generated; and (2) the loss landscape of DMs is highly smooth, which implies that the model tends to converge to similar local minima and exhibit similar behavior patterns. This finding not only reveals the stability of DMs, but also inspires us to devise two strategies to accelerate the training of DMs. First, we propose a curriculum learning based timestep schedule, which leverages the noise rate as an explicit indicator of the learning difficulty and gradually reduces the training frequency of easier timesteps, thus improving the training efficiency. Second, we propose a momentum decay strategy, which reduces the momentum coefficient during the optimization process, as the large momentum may hinder the convergence speed and cause oscillations due to the smoothness of the loss landscape. We demonstrate the effectiveness of our proposed strategies on various models and show that they can significantly reduce the training time and improve the quality of the generated images.
- [1178] arXiv:2404.07948 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: Usability and Performance Analysis of Embedded Development Environment for On-device LearningEnzo Scaffi (DYNAMID), Antoine Bonneau (DYNAMID, EE), Frédéric Le Mouël (DYNAMID), Fabien Mieyeville (INL, EE)Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This research empirically examines embedded development tools viable for on-device TinyML implementation. The research evaluates various development tools with various abstraction levels on resource-constrained IoT devices, from basic hardware manipulation to deployment of minimalistic ML training. The analysis encompasses memory usage, energy consumption, and performance metrics during model training and inference and usability of the different solutions. Arduino Framework offers ease of implementation but with increased energy consumption compared to the native option, while RIOT OS exhibits efficient energy consumption despite higher memory utilization with equivalent ease of use. The absence of certain critical functionalities like DVFS directly integrated into the OS highlights limitations for fine hardware control.
- [1179] arXiv:2404.07950 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning with Generalizable Gaussian SplattingComments: 7 pages,2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: An excellent representation is crucial for reinforcement learning (RL) performance, especially in vision-based reinforcement learning tasks. The quality of the environment representation directly influences the achievement of the learning task. Previous vision-based RL typically uses explicit or implicit ways to represent environments, such as images, points, voxels, and neural radiance fields. However, these representations contain several drawbacks. They cannot either describe complex local geometries or generalize well to unseen scenes, or require precise foreground masks. Moreover, these implicit neural representations are akin to a ``black box", significantly hindering interpretability. 3D Gaussian Splatting (3DGS), with its explicit scene representation and differentiable rendering nature, is considered a revolutionary change for reconstruction and representation methods. In this paper, we propose a novel Generalizable Gaussian Splatting framework to be the representation of RL tasks, called GSRL. Through validation in the RoboMimic environment, our method achieves better results than other baselines in multiple tasks, improving the performance by 10%, 44%, and 15% compared with baselines on the hardest task. This work is the first attempt to leverage generalizable 3DGS as a representation for RL.
- [1180] arXiv:2404.07956 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Lyapunov-stable Neural Control for State and Output Feedback: A Novel Formulation for Efficient Synthesis and VerificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
Abstract: Learning-based neural network (NN) control policies have shown impressive empirical performance in a wide range of tasks in robotics and control. However, formal (Lyapunov) stability guarantees over the region-of-attraction (ROA) for NN controllers with nonlinear dynamical systems are challenging to obtain, and most existing approaches rely on expensive solvers such as sums-of-squares (SOS), mixed-integer programming (MIP), or satisfiability modulo theories (SMT). In this paper, we demonstrate a new framework for learning NN controllers together with Lyapunov certificates using fast empirical falsification and strategic regularizations. We propose a novel formulation that defines a larger verifiable region-of-attraction (ROA) than shown in the literature, and refines the conventional restrictive constraints on Lyapunov derivatives to focus only on certifiable ROAs. The Lyapunov condition is rigorously verified post-hoc using branch-and-bound with scalable linear bound propagation-based NN verification techniques. The approach is efficient and flexible, and the full training and verification procedure is accelerated on GPUs without relying on expensive solvers for SOS, MIP, nor SMT. The flexibility and efficiency of our framework allow us to demonstrate Lyapunov-stable output feedback control with synthesized NN-based controllers and NN-based observers with formal stability guarantees, for the first time in literature. Source code at this https URL .
- [1181] arXiv:2404.07963 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: EduAgent: Generative Student Agents in LearningSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Student simulation in online education is important to address dynamic learning behaviors of students with diverse backgrounds. Existing simulation models based on deep learning usually need massive training data, lacking prior knowledge in educational contexts. Large language models (LLMs) may contain such prior knowledge since they are pre-trained from a large corpus. However, because student behaviors are dynamic and multifaceted with individual differences, directly prompting LLMs is not robust nor accurate enough to capture fine-grained interactions among diverse student personas, learning behaviors, and learning outcomes. This work tackles this problem by presenting a newly annotated fine-grained large-scale dataset and proposing EduAgent, a novel generative agent framework incorporating cognitive prior knowledge (i.e., theoretical findings revealed in cognitive science) to guide LLMs to first reason correlations among various behaviors and then make simulations. Our two experiments show that EduAgent could not only mimic and predict learning behaviors of real students but also generate realistic learning behaviors of virtual students without real data.
- [1182] arXiv:2404.07965 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rho-1: Not All Tokens Are What You NeedZhenghao Lin , Zhibin Gou , Yeyun Gong , Xiao Liu , Yelong Shen , Ruochen Xu , Chen Lin , Yujiu Yang , Jian Jiao , Nan Duan , Weizhu ChenComments: First two authors equal contributionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.
- [1183] arXiv:2404.07968 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: AD-NEv++ : The multi-architecture neuroevolution-based multivariate anomaly detection frameworkSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Anomaly detection tools and methods enable key analytical capabilities in modern cyberphysical and sensor-based systems. Despite the fast-paced development in deep learning architectures for anomaly detection, model optimization for a given dataset is a cumbersome and time-consuming process. Neuroevolution could be an effective and efficient solution to this problem, as a fully automated search method for learning optimal neural networks, supporting both gradient and non-gradient fine tuning. However, existing frameworks incorporating neuroevolution lack of support for new layers and architectures and are typically limited to convolutional and LSTM layers. In this paper we propose AD-NEv++, a three-stage neuroevolution-based method that synergically combines subspace evolution, model evolution, and fine-tuning. Our method overcomes the limitations of existing approaches by optimizing the mutation operator in the neuroevolution process, while supporting a wide spectrum of neural layers, including attention, dense, and graph convolutional layers. Our extensive experimental evaluation was conducted with widely adopted multivariate anomaly detection benchmark datasets, and showed that the models generated by AD-NEv++ outperform well-known deep learning architectures and neuroevolution-based approaches for anomaly detection. Moreover, results show that AD-NEv++ can improve and outperform the state-of-the-art GNN (Graph Neural Networks) model architecture in all anomaly detection benchmarks.
- [1184] arXiv:2404.07969 (cross-list from q-fin.ST) [ pdf , ps , other ]
-
Title: An End-to-End Structure with Novel Position Mechanism and Improved EMD for Stock ForecastingComments: ICONIP 2023Subjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As a branch of time series forecasting, stock movement forecasting is one of the challenging problems for investors and researchers. Since Transformer was introduced to analyze financial data, many researchers have dedicated themselves to forecasting stock movement using Transformer or attention mechanisms. However, existing research mostly focuses on individual stock information but ignores stock market information and high noise in stock data. In this paper, we propose a novel method using the attention mechanism in which both stock market information and individual stock information are considered. Meanwhile, we propose a novel EMD-based algorithm for reducing short-term noise in stock data. Two randomly selected exchange-traded funds (ETFs) spanning over ten years from US stock markets are used to demonstrate the superior performance of the proposed attention-based method. The experimental analysis demonstrates that the proposed attention-based method significantly outperforms other state-of-the-art baselines. Code is available at this https URL .
- [1185] arXiv:2404.07976 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Self-supervised Dataset Distillation: A Good Compression Is All You NeedSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Dataset distillation aims to compress information from a large-scale original dataset to a new compact dataset while striving to preserve the utmost degree of the original data informational essence. Previous studies have predominantly concentrated on aligning the intermediate statistics between the original and distilled data, such as weight trajectory, features, gradient, BatchNorm, etc. In this work, we consider addressing this task through the new lens of model informativeness in the compression stage on the original dataset pretraining. We observe that with the prior state-of-the-art SRe$^2$L, as model sizes increase, it becomes increasingly challenging for supervised pretrained models to recover learned information during data synthesis, as the channel-wise mean and variance inside the model are flatting and less informative. We further notice that larger variances in BN statistics from self-supervised models enable larger loss signals to update the recovered data by gradients, enjoying more informativeness during synthesis. Building on this observation, we introduce SC-DD, a simple yet effective Self-supervised Compression framework for Dataset Distillation that facilitates diverse information compression and recovery compared to traditional supervised learning schemes, further reaps the potential of large pretrained models with enhanced capabilities. Extensive experiments are conducted on CIFAR-100, Tiny-ImageNet and ImageNet-1K datasets to demonstrate the superiority of our proposed approach. The proposed SC-DD outperforms all previous state-of-the-art supervised dataset distillation methods when employing larger models, such as SRe$^2$L, MTT, TESLA, DC, CAFE, etc., by large margins under the same recovery and post-training budgets. Code is available at this https URL .
- [1186] arXiv:2404.07979 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLoCO: Learning Long Contexts OfflineSijun Tan , Xiuyu Li , Shishir Patil , Ziyang Wu , Tianjun Zhang , Kurt Keutzer , Joseph E. Gonzalez , Raluca Ada PopaComments: The first two authors contributed equally to this workSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using $30\times$ fewer tokens during inference. LLoCO achieves up to $7.62\times$ speed-up and substantially reduces the cost of long document question answering, making it a promising solution for efficient long context processing. Our code is publicly available at this https URL .
- [1187] arXiv:2404.07981 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: Manipulating Large Language Models to Increase Product VisibilitySubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product's visibility. We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM's recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services. Code for our experiments is available at this https URL .
- [1188] arXiv:2404.07987 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ControlNet++: Improving Conditional Controls with Efficient Consistency FeedbackComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
- [1189] arXiv:2404.07989 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Any2Point: Empowering Any-modality Large Models for Efficient 3D UnderstandingComments: Code and models are released at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at this https URL .
- [1190] arXiv:2404.07990 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OpenBias: Open-set Bias Detection in Text-to-Image Generative ModelsMoreno D'Incà , Elia Peruzzo , Massimiliano Mancini , Dejia Xu , Vidit Goel , Xingqian Xu , Zhangyang Wang , Humphrey Shi , Nicu SebeComments: CVPR 2024 Highlight - Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In this paper, we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias, a new pipeline that identifies and quantifies the severity of biases agnostically, without access to any precompiled set. OpenBias has three stages. In the first phase, we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly, the target generative model produces images using the same set of captions. Lastly, a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated before. Via quantitative experiments, we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.
- [1191] arXiv:2404.08001 (cross-list from hep-ph) [ pdf , ps , html , other ]
-
Title: Xiwu: A Basis Flexible and Learnable LLM for High Energy PhysicsZhengde Zhang , Yiyu Zhang , Haodong Yao , Jianwen Luo , Rui Zhao , Bo Huang , Jiameng Zhao , Yipu Liao , Ke Li , Lina Zhao , Jun Cao , Fazhi Qi , Changzheng YuanComments: 15 pages, 8 figuresSubjects: High Energy Physics - Phenomenology (hep-ph) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Computational Physics (physics.comp-ph)
Abstract: Large Language Models (LLMs) are undergoing a period of rapid updates and changes, with state-of-the-art (SOTA) model frequently being replaced. When applying LLMs to a specific scientific field, it's challenging to acquire unique domain knowledge while keeping the model itself advanced. To address this challenge, a sophisticated large language model system named as Xiwu has been developed, allowing you switch between the most advanced foundation models and quickly teach the model domain knowledge. In this work, we will report on the best practices for applying LLMs in the field of high-energy physics (HEP), including: a seed fission technology is proposed and some data collection and cleaning tools are developed to quickly obtain domain AI-Ready dataset; a just-in-time learning system is implemented based on the vector store technology; an on-the-fly fine-tuning system has been developed to facilitate rapid training under a specified foundation model. The results show that Xiwu can smoothly switch between foundation models such as LLaMA, Vicuna, ChatGLM and Grok-1. The trained Xiwu model is significantly outperformed the benchmark model on the HEP knowledge question-and-answering and code generation. This strategy significantly enhances the potential for growth of our model's performance, with the hope of surpassing GPT-4 as it evolves with the development of open-source models. This work provides a customized LLM for the field of HEP, while also offering references for applying LLM to other fields, the corresponding codes are available on Github.
- [1192] arXiv:2404.08006 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Learning Efficient and Fair Policies for Uncertainty-Aware Collaborative Human-Robot Order PickingIgor G. Smit , Zaharah Bukhsh , Mykola Pechenizkiy , Kostas Alogariastos , Kasper Hendriks , Yingqian ZhangSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract: In collaborative human-robot order picking systems, human pickers and Autonomous Mobile Robots (AMRs) travel independently through a warehouse and meet at pick locations where pickers load items onto the AMRs. In this paper, we consider an optimization problem in such systems where we allocate pickers to AMRs in a stochastic environment. We propose a novel multi-objective Deep Reinforcement Learning (DRL) approach to learn effective allocation policies to maximize pick efficiency while also aiming to improve workload fairness amongst human pickers. In our approach, we model the warehouse states using a graph, and define a neural network architecture that captures regional information and effectively extracts representations related to efficiency and workload. We develop a discrete-event simulation model, which we use to train and evaluate the proposed DRL approach. In the experiments, we demonstrate that our approach can find non-dominated policy sets that outline good trade-offs between fairness and efficiency objectives. The trained policies outperform the benchmarks in terms of both efficiency and fairness. Moreover, they show good transferability properties when tested on scenarios with different warehouse sizes. The implementation of the simulation model, proposed approach, and experiments are published.
- [1193] arXiv:2404.08013 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhanced Cooperative Perception for Autonomous Vehicles Using Imperfect CommunicationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Sharing and joint processing of camera feeds and sensor measurements, known as Cooperative Perception (CP), has emerged as a new technique to achieve higher perception qualities. CP can enhance the safety of Autonomous Vehicles (AVs) where their individual visual perception quality is compromised by adverse weather conditions (haze as foggy weather), low illumination, winding roads, and crowded traffic. To cover the limitations of former methods, in this paper, we propose a novel approach to realize an optimized CP under constrained communications. At the core of our approach is recruiting the best helper from the available list of front vehicles to augment the visual range and enhance the Object Detection (OD) accuracy of the ego vehicle. In this two-step process, we first select the helper vehicles that contribute the most to CP based on their visual range and lowest motion blur. Next, we implement a radio block optimization among the candidate vehicles to further improve communication efficiency. We specifically focus on pedestrian detection as an exemplary scenario. To validate our approach, we used the CARLA simulator to create a dataset of annotated videos for different driving scenarios where pedestrian detection is challenging for an AV with compromised vision. Our results demonstrate the efficacy of our two-step optimization process in improving the overall performance of cooperative perception in challenging scenarios, substantially improving driving safety under adverse conditions. Finally, we note that the networking assumptions are adopted from LTE Release 14 Mode 4 side-link communication, commonly used for Vehicle-to-Vehicle (V2V) communication. Nonetheless, our method is flexible and applicable to arbitrary V2V communications.
- [1194] arXiv:2404.08017 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: AI-Guided Feature Segmentation Techniques to Model Features from Single Crystal Diamond GrowthRohan Reddy Mekala , Elias Garratt , Matthias Muehle , Arjun Srinivasan , Adam Porter , Mikael LindvallComments: 12 pages,4 figures,ACMME 2024. arXiv admin note: substantial text overlap with arXiv:2404.07306Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Process refinement to consistently produce high-quality material over a large area of the grown crystal, enabling various applications from optics crystals to quantum detectors, has long been a goal for diamond growth. Machine learning offers a promising path toward this goal, but faces challenges such as the complexity of features within datasets, their time-dependency, and the volume of data produced per growth run. Accurate spatial feature extraction from image to image for real-time monitoring of diamond growth is crucial yet complicated due to the low-volume and high feature complexity nature of the datasets. This paper compares various traditional and machine learning-driven approaches for feature extraction in the diamond growth domain, proposing a novel deep learning-driven semantic segmentation approach to isolate and classify accurate pixel masks of geometric features like diamond, pocket holder, and background, along with their derivative features based on shape and size. Using an annotation-focused human-in-the-loop software architecture for training datasets, with modules for selective data labeling using active learning, data augmentations, and model-assisted labeling, our approach achieves effective annotation accuracy and drastically reduces labeling time and cost. Deep learning algorithms prove highly efficient in accurately learning complex representations from datasets with many features. Our top-performing model, based on the DeeplabV3plus architecture, achieves outstanding accuracy in classifying features of interest, with accuracies of 96.31% for pocket holder, 98.60% for diamond top, and 91.64% for diamond side features.
- [1195] arXiv:2404.08018 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Analyzing the Performance of Large Language Models on Code SummarizationSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) such as Llama 2 perform very well on tasks that involve both natural language and source code, particularly code summarization and code generation. We show that for the task of code summarization, the performance of these models on individual examples often depends on the amount of (subword) token overlap between the code and the corresponding reference natural language descriptions in the dataset. This token overlap arises because the reference descriptions in standard datasets (corresponding to docstrings in large code bases) are often highly similar to the names of the functions they describe. We also show that this token overlap occurs largely in the function names of the code and compare the relative performance of these models after removing function names versus removing code structure. We also show that using multiple evaluation metrics like BLEU and BERTScore gives us very little additional insight since these metrics are highly correlated with each other.
- [1196] arXiv:2404.08021 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: VeTraSS: Vehicle Trajectory Similarity Search Through Graph Modeling and Representation LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Trajectory similarity search plays an essential role in autonomous driving, as it enables vehicles to analyze the information and characteristics of different trajectories to make informed decisions and navigate safely in dynamic environments. Existing work on the trajectory similarity search task primarily utilizes sequence-processing algorithms or Recurrent Neural Networks (RNNs), which suffer from the inevitable issues of complicated architecture and heavy training costs. Considering the intricate connections between trajectories, using Graph Neural Networks (GNNs) for data modeling is feasible. However, most methods directly use existing mathematical graph structures as the input instead of constructing specific graphs from certain vehicle trajectory data. This ignores such data's unique and dynamic characteristics. To bridge such a research gap, we propose VeTraSS -- an end-to-end pipeline for Vehicle Trajectory Similarity Search. Specifically, VeTraSS models the original trajectory data into multi-scale graphs, and generates comprehensive embeddings through a novel multi-layer attention-based GNN. The learned embeddings can be used for searching similar vehicle trajectories. Extensive experiments on the Porto and Geolife datasets demonstrate the effectiveness of VeTraSS, where our model outperforms existing work and reaches the state-of-the-art. This demonstrates the potential of VeTraSS for trajectory analysis and safe navigation in self-driving vehicles in the real world.
- [1197] arXiv:2404.08027 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Abstract: Multi-modal learning that combines pathological images with genomic data has significantly enhanced the accuracy of survival prediction. Nevertheless, existing methods have not fully utilized the inherent hierarchical structure within both whole slide images (WSIs) and transcriptomic data, from which better intra-modal representations and inter-modal integration could be derived. Moreover, many existing studies attempt to improve multi-modal representations through attention mechanisms, which inevitably lead to high complexity when processing high-dimensional WSIs and transcriptomic data. Recently, a structured state space model named Mamba emerged as a promising approach for its superior performance in modeling long sequences with low complexity. In this study, we propose Mamba with multi-grained multi-modal interaction (SurvMamba) for survival prediction. SurvMamba is implemented with a Hierarchical Interaction Mamba (HIM) module that facilitates efficient intra-modal interactions at different granularities, thereby capturing more detailed local features as well as rich global representations. In addition, an Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal interactive fusion, yielding more comprehensive features for survival prediction. Comprehensive evaluations on five TCGA datasets demonstrate that SurvMamba outperforms other existing methods in terms of performance and computational cost.
- [1198] arXiv:2404.08029 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Multi-Expert Large Language Model Architecture for Verilog Code GenerationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
Abstract: Recently, there has been a surging interest in using large language models (LLMs) for Verilog code generation. However, the existing approaches are limited in terms of the quality of the generated Verilog code. To address such limitations, this paper introduces an innovative multi-expert LLM architecture for Verilog code generation (MEV-LLM). Our architecture uniquely integrates multiple LLMs, each specifically fine-tuned with a dataset that is categorized with respect to a distinct level of design complexity. It allows more targeted learning, directly addressing the nuances of generating Verilog code for each category. Empirical evidence from experiments highlights notable improvements in terms of the percentage of generated Verilog outputs that are syntactically and functionally correct. These findings underscore the efficacy of our approach, promising a forward leap in the field of automated hardware design through machine learning.
- [1199] arXiv:2404.08030 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative ModelsMazda Moayeri , Samyadeep Basu , Sriram Balasubramanian , Priyatham Kattakinda , Atoosa Chengini , Robert Brauneis , Soheil FeiziSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent text-to-image generative models such as Stable Diffusion are extremely adept at mimicking and generating copyrighted content, raising concerns amongst artists that their unique styles may be improperly copied. Understanding how generative models copy "artistic style" is more complex than duplicating a single image, as style is comprised by a set of elements (or signature) that frequently co-occurs across a body of work, where each individual work may vary significantly. In our paper, we first reformulate the problem of "artistic copyright infringement" to a classification problem over image sets, instead of probing image-wise similarities. We then introduce ArtSavant, a practical (i.e., efficient and easy to understand) tool to (i) determine the unique style of an artist by comparing it to a reference dataset of works from 372 artists curated from WikiArt, and (ii) recognize if the identified style reappears in generated images. We leverage two complementary methods to perform artistic style classification over image sets, includingTagMatch, which is a novel inherently interpretable and attributable method, making it more suitable for broader use by non-technical stake holders (artists, lawyers, judges, etc). Leveraging ArtSavant, we then perform a large-scale empirical study to provide quantitative insight on the prevalence of artistic style copying across 3 popular text-to-image generative models. Namely, amongst a dataset of prolific artists (including many famous ones), only 20% of them appear to have their styles be at a risk of copying via simple prompting of today's popular text-to-image generative models.
- [1200] arXiv:2404.08031 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Latent Guard: a Safety Framework for Text-to-image GenerationComments: under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at this https URL .
- [1201] arXiv:2404.08061 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Physics-Enhanced Graph Neural Networks For Soft Sensing in Industrial Internet of ThingsComments: 12 pages, 10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: The Industrial Internet of Things (IIoT) is reshaping manufacturing, industrial processes, and infrastructure management. By fostering new levels of automation, efficiency, and predictive maintenance, IIoT is transforming traditional industries into intelligent, seamlessly interconnected ecosystems. However, achieving highly reliable IIoT can be hindered by factors such as the cost of installing large numbers of sensors, limitations in retrofitting existing systems with sensors, or harsh environmental conditions that may make sensor installation impractical. Soft (virtual) sensing leverages mathematical models to estimate variables from physical sensor data, offering a solution to these challenges. Data-driven and physics-based modeling are the two main methodologies widely used for soft sensing. The choice between these strategies depends on the complexity of the underlying system, with the data-driven approach often being preferred when the physics-based inference models are intricate and present challenges for state estimation. However, conventional deep learning models are typically hindered by their inability to explicitly represent the complex interactions among various sensors. To address this limitation, we adopt Graph Neural Networks (GNNs), renowned for their ability to effectively capture the complex relationships between sensor measurements. In this research, we propose physics-enhanced GNNs, which integrate principles of physics into graph-based methodologies. This is achieved by augmenting additional nodes in the input graph derived from the underlying characteristics of the physical processes. Our evaluation of the proposed methodology on the case study of district heating networks reveals significant improvements over purely data-driven GNNs, even in the presence of noise and parameter inaccuracies.
- [1202] arXiv:2404.08064 (cross-list from eess.AS) [ pdf , ps , other ]
-
Title: The Impact of Speech Anonymization on Pathology and Its LimitsSoroosh Tayebi Arasteh , Tomas Arias-Vergara , Paula Andrea Perez-Toro , Tobias Weise , Kai Packhaeuser , Maria Schuster , Elmar Noeth , Andreas Maier , Seung Hee YangSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Integration of speech into healthcare has intensified privacy concerns due to its potential as a non-invasive biomarker containing individual biometric information. In response, speaker anonymization aims to conceal personally identifiable information while retaining crucial linguistic content. However, the application of anonymization techniques to pathological speech, a critical area where privacy is especially vital, has not been extensively examined. This study investigates anonymization's impact on pathological speech across over 2,700 speakers from multiple German institutions, focusing on privacy, pathological utility, and demographic fairness. We explore both training-based and signal processing-based anonymization methods, and document substantial privacy improvements across disorders-evidenced by equal error rate increases up to 1933%, with minimal overall impact on utility. Specific disorders such as Dysarthria, Dysphonia, and Cleft Lip and Palate experienced minimal utility changes, while Dysglossia showed slight improvements. Our findings underscore that the impact of anonymization varies substantially across different disorders. This necessitates disorder-specific anonymization strategies to optimally balance privacy with diagnostic utility. Additionally, our fairness analysis revealed consistent anonymization effects across most of the demographics. This study demonstrates the effectiveness of anonymization in pathological speech for enhancing privacy, while also highlighting the importance of customized approaches to account for inversion attacks.
- [1203] arXiv:2404.08068 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: WildGraph: Realistic Graph-based Trajectory Generation for WildlifeSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Trajectory generation is an important task in movement studies; it circumvents the privacy, ethical, and technical challenges of collecting real trajectories from the target population. In particular, real trajectories in the wildlife domain are scarce as a result of ethical and environmental constraints of the collection process. In this paper, we consider the problem of generating long-horizon trajectories, akin to wildlife migration, based on a small set of real samples. We propose a hierarchical approach to learn the global movement characteristics of the real dataset and recursively refine localized regions. Our solution, WildGraph, discretizes the geographic path into a prototype network of H3 ( this https URL ) regions and leverages a recurrent variational auto-encoder to probabilistically generate paths over the regions, based on occupancy. WildGraph successfully generates realistic months-long trajectories using a sample size as small as 60. Experiments performed on two wildlife migration datasets demonstrate that our proposed method improves the generalization of the generated trajectories in comparison to existing work while achieving superior or comparable performance in several benchmark metrics. Our code is published on the following repository: \url{ this https URL }.
- [1204] arXiv:2404.08078 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political DiscussionsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
- [1205] arXiv:2404.08080 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Variance-reduced Zeroth-Order Methods for Fine-Tuning Language ModelsComments: 29 pages, 25 tables, 9 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
Abstract: Fine-tuning language models (LMs) has demonstrated success in a wide array of downstream tasks. However, as LMs are scaled up, the memory requirements for backpropagation become prohibitively high. Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate gradients. More recently, MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning when combined with suitable task prompts. In this work, we couple ZO methods with variance reduction techniques to enhance stability and convergence for inference-based LM fine-tuning. We introduce Memory-Efficient Zeroth-Order Stochastic Variance-Reduced Gradient (MeZO-SVRG) and demonstrate its efficacy across multiple LM fine-tuning tasks, eliminating the reliance on task-specific prompts. Evaluated across a range of both masked and autoregressive LMs on benchmark GLUE tasks, MeZO-SVRG outperforms MeZO with up to 20% increase in test accuracies in both full- and partial-parameter fine-tuning settings. MeZO-SVRG benefits from reduced computation time as it often surpasses MeZO's peak test accuracy with a $2\times$ reduction in GPU-hours. MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD, i.e. by $2\times$ for autoregressive models. Our experiments highlight that MeZO-SVRG's memory savings progressively improve compared to SGD with larger batch sizes.
- [1206] arXiv:2404.08092 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Data-Augmentation-Based Dialectal Adaptation for LLMsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTić) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code: this https URL
- [1207] arXiv:2404.08093 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Towards a Robust Soft Baby Robot With Rich Interaction Ability for Advanced Machine Learning AlgorithmsMohannad Alhakami , Dylan R. Ashley , Joel Dunham , Francesco Faccio , Eric Feron , Jürgen SchmidhuberComments: 5 pages in main text + 1 page of references, 7 figures in main text; source code available at this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Artificial intelligence has made great strides in many areas lately, yet it has had comparatively little success in general-use robotics. We believe one of the reasons for this is the disconnect between traditional robotic design and the properties needed for open-ended, creativity-based AI systems. To that end, we, taking selective inspiration from nature, build a robust, partially soft robotic limb with a large action space, rich sensory data stream from multiple cameras, and the ability to connect with others to enhance the action space and data stream. As a proof of concept, we train two contemporary machine learning algorithms to perform a simple target-finding task. Altogether, we believe that this design serves as a first step to building a robot tailor-made for achieving artificial general intelligence.
- [1208] arXiv:2404.08111 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video EditingGuangzhi Wang , Tianyi Chen , Kamran Ghasedi , HsiangTao Wu , Tianyu Ding , Chris Nuesmeyer , Ilya Zharkov , Mohan Kankanhalli , Luming LiangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
- [1209] arXiv:2404.08113 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Predictive Handover Strategy in 6G and Beyond: A Deep and Transfer Learning ApproachSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: Next-generation cellular networks will evolve into more complex and virtualized systems, employing machine learning for enhanced optimization and leveraging higher frequency bands and denser deployments to meet varied service demands. This evolution, while bringing numerous advantages, will also pose challenges, especially in mobility management, as it will increase the overall number of handovers due to smaller coverage areas and the higher signal attenuation. To address these challenges, we propose a deep learning based algorithm for predicting the future serving cell utilizing sequential user equipment measurements to minimize the handover failures and interruption time. Our algorithm enables network operators to dynamically adjust handover triggering events or incorporate UAV base stations for enhanced coverage and capacity, optimizing network objectives like load balancing and energy efficiency through transfer learning techniques. Our framework complies with the O-RAN specifications and can be deployed in a Near-Real-Time RAN Intelligent Controller as an xApp leveraging the E2SM-KPM service model. The evaluation results demonstrate that our algorithm achieves a 92% accuracy in predicting future serving cells with high probability. Finally, by utilizing transfer learning, our algorithm significantly reduces the retraining time by 91% and 77% when new handover trigger decisions or UAV base stations are introduced to the network dynamically.
- [1210] arXiv:2404.08126 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Auctions with LLM SummariesSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: We study an auction setting in which bidders bid for placement of their content within a summary generated by a large language model (LLM), e.g., an ad auction in which the display is a summary paragraph of multiple ads. This generalizes the classic ad settings such as position auctions to an LLM generated setting, which allows us to handle general display formats. We propose a novel factorized framework in which an auction module and an LLM module work together via a prediction model to provide welfare maximizing summary outputs in an incentive compatible manner. We provide a theoretical analysis of this framework and synthetic experiments to demonstrate the feasibility and validity of the system together with welfare comparisons.
- [1211] arXiv:2404.08127 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Self-Supervised Learning of Color ConstancyComments: 7 pages, 5 figures, submitted to the IEEE International Conference on Development and Learning (ICDL 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Color constancy (CC) describes the ability of the visual system to perceive an object as having a relatively constant color despite changes in lighting conditions. While CC and its limitations have been carefully characterized in humans, it is still unclear how the visual system acquires this ability during development. Here, we present a first study showing that CC develops in a neural network trained in a self-supervised manner through an invariance learning objective. During learning, objects are presented under changing illuminations, while the network aims to map subsequent views of the same object onto close-by latent representations. This gives rise to representations that are largely invariant to the illumination conditions, offering a plausible example of how CC could emerge during human cognitive development via a form of self-supervised learning.
- [1212] arXiv:2404.08144 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: LLM Agents can Autonomously Exploit One-day VulnerabilitiesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: LLMs have becoming increasingly powerful, both in their benign and malicious uses. With the increase in capabilities, researchers have been increasingly interested in their ability to exploit cybersecurity vulnerabilities. In particular, recent work has conducted preliminary studies on the ability of LLM agents to autonomously hack websites. However, these studies are limited to simple vulnerabilities.
In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents. - [1213] arXiv:2404.08161 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: R2 Indicator and Deep Reinforcement Learning Enhanced Adaptive Multi-Objective Evolutionary AlgorithmSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Choosing an appropriate optimization algorithm is essential to achieving success in optimization challenges. Here we present a new evolutionary algorithm structure that utilizes a reinforcement learning-based agent aimed at addressing these issues. The agent employs a double deep q-network to choose a specific evolutionary operator based on feedback it receives from the environment during optimization. The algorithm's structure contains five single-objective evolutionary algorithm operators. This single-objective structure is transformed into a multi-objective one using the R2 indicator. This indicator serves two purposes within our structure: first, it renders the algorithm multi-objective, and second, provides a means to evaluate each algorithm's performance in each generation to facilitate constructing the reinforcement learning-based reward function. The proposed R2-reinforcement learning multi-objective evolutionary algorithm (R2-RLMOEA) is compared with six other multi-objective algorithms that are based on R2 indicators. These six algorithms include the operators used in R2-RLMOEA as well as an R2 indicator-based algorithm that randomly selects operators during optimization. We benchmark performance using the CEC09 functions, with performance measured by inverted generational distance and spacing. The R2-RLMOEA algorithm outperforms all other algorithms with strong statistical significance (p<0.001) when compared with the average spacing metric across all ten benchmarks.
- [1214] arXiv:2404.08164 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Language Model Prompt Selection via Simulation OptimizationSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: With the advancement in generative language models, the selection of prompts has gained significant attention in recent years. A prompt is an instruction or description provided by the user, serving as a guide for the generative language model in content generation. Despite existing methods for prompt selection that are based on human labor, we consider facilitating this selection through simulation optimization, aiming to maximize a pre-defined score for the selected prompt. Specifically, we propose a two-stage framework. In the first stage, we determine a feasible set of prompts in sufficient numbers, where each prompt is represented by a moderate-dimensional vector. In the subsequent stage for evaluation and selection, we construct a surrogate model of the score regarding the moderate-dimensional vectors that represent the prompts. We propose sequentially selecting the prompt for evaluation based on this constructed surrogate model. We prove the consistency of the sequential evaluation procedure in our framework. We also conduct numerical experiments to demonstrate the efficacy of our proposed framework, providing practical instructions for implementation.
- [1215] arXiv:2404.08189 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Reducing hallucination in structured outputs via Retrieval-Augmented GenerationComments: To be presented at NAACL 2024. 11 pages and 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Abstract: A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.
- [1216] arXiv:2404.08221 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Uncertain Boundaries: Multidisciplinary Approaches to Copyright Issues in Generative AISubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: In the rapidly evolving landscape of generative artificial intelligence (AI), the increasingly pertinent issue of copyright infringement arises as AI advances to generate content from scraped copyrighted data, prompting questions about ownership and protection that impact professionals across various careers. With this in mind, this survey provides an extensive examination of copyright infringement as it pertains to generative AI, aiming to stay abreast of the latest developments and open problems. Specifically, it will first outline methods of detecting copyright infringement in mediums such as text, image, and video. Next, it will delve an exploration of existing techniques aimed at safeguarding copyrighted works from generative models. Furthermore, this survey will discuss resources and tools for users to evaluate copyright violations. Finally, insights into ongoing regulations and proposals for AI will be explored and compared. Through combining these disciplines, the implications of AI-driven content and copyright are thoroughly illustrated and brought into question.
- [1217] arXiv:2404.08224 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HCL-MTSAD: Hierarchical Contrastive Consistency Learning for Accurate Detection of Industrial Multivariate Time Series AnomaliesComments: This paper is a manuscript that is still in the process of revision, including Table 1, Figure 2, problem definition in section III.B and method description proposed in section IV. In addition, the submitter has not been authorized by the first author and other co-authors to post the paper to arXivSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Systems and Control (eess.SY)
Abstract: Multivariate Time Series (MTS) anomaly detection focuses on pinpointing samples that diverge from standard operational patterns, which is crucial for ensuring the safety and security of industrial applications. The primary challenge in this domain is to develop representations capable of discerning anomalies effectively. The prevalent methods for anomaly detection in the literature are predominantly reconstruction-based and predictive in nature. However, they typically concentrate on a single-dimensional instance level, thereby not fully harnessing the complex associations inherent in industrial MTS. To address this issue, we propose a novel self-supervised hierarchical contrastive consistency learning method for detecting anomalies in MTS, named HCL-MTSAD. It innovatively leverages data consistency at multiple levels inherent in industrial MTS, systematically capturing consistent associations across four latent levels-measurement, sample, channel, and process. By developing a multi-layer contrastive loss, HCL-MTSAD can extensively mine data consistency and spatio-temporal association, resulting in more informative representations. Subsequently, an anomaly discrimination module, grounded in self-supervised hierarchical contrastive learning, is designed to detect timestamp-level anomalies by calculating multi-scale data consistency. Extensive experiments conducted on six diverse MTS datasets retrieved from real cyber-physical systems and server machines, in comparison with 20 baselines, indicate that HCL-MTSAD's anomaly detection capability outperforms the state-of-the-art benchmark models by an average of 1.8\% in terms of F1 score.
- [1218] arXiv:2404.08233 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement LearningComments: IEEE Transactions on Emerging Topics in Computational IntelligenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Hyperparameter optimization plays a key role in the machine learning domain. Its significance is especially pronounced in reinforcement learning (RL), where agents continuously interact with and adapt to their environments, requiring dynamic adjustments in their learning trajectories. To cater to this dynamicity, the Population-Based Training (PBT) was introduced, leveraging the collective intelligence of a population of agents learning simultaneously. However, PBT tends to favor high-performing agents, potentially neglecting the explorative potential of agents on the brink of significant advancements. To mitigate the limitations of PBT, we present the Generalized Population-Based Training (GPBT), a refined framework designed for enhanced granularity and flexibility in hyperparameter adaptation. Complementing GPBT, we further introduce Pairwise Learning (PL). Instead of merely focusing on elite agents, PL employs a comprehensive pairwise strategy to identify performance differentials and provide holistic guidance to underperforming agents. By integrating the capabilities of GPBT and PL, our approach significantly improves upon traditional PBT in terms of adaptability and computational efficiency. Rigorous empirical evaluations across a range of RL benchmarks confirm that our approach consistently outperforms not only the conventional PBT but also its Bayesian-optimized variant.
- [1219] arXiv:2404.08237 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision TransformerComments: ready to submit to IEEE Transactions on Information Forensics and Security (TIFS)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Determining dense feature points on fingerprints used in constructing deep fixed-length representations for accurate matching, particularly at the pixel level, is of significant interest. To explore the interpretability of fingerprint matching, we propose a multi-stage interpretable fingerprint matching network, namely Interpretable Fixed-length Representation for Fingerprint Matching via Vision Transformer (IFViT), which consists of two primary modules. The first module, an interpretable dense registration module, establishes a Vision Transformer (ViT)-based Siamese Network to capture long-range dependencies and the global context in fingerprint pairs. It provides interpretable dense pixel-wise correspondences of feature points for fingerprint alignment and enhances the interpretability in the subsequent matching stage. The second module takes into account both local and global representations of the aligned fingerprint pair to achieve an interpretable fixed-length representation extraction and matching. It employs the ViTs trained in the first module with the additional fully connected layer and retrains them to simultaneously produce the discriminative fixed-length representation and interpretable dense pixel-wise correspondences of feature points. Extensive experimental results on diverse publicly available fingerprint databases demonstrate that the proposed framework not only exhibits superior performance on dense registration and matching but also significantly promotes the interpretability in deep fixed-length representations-based fingerprint matching.
- [1220] arXiv:2404.08239 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Auto-configuring Exploration-Exploitation Tradeoff in Evolutionary Computation via Deep Reinforcement LearningComments: Accepted as a full paper at GECCO 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Evolutionary computation (EC) algorithms, renowned as powerful black-box optimizers, leverage a group of individuals to cooperatively search for the optimum. The exploration-exploitation tradeoff (EET) plays a crucial role in EC, which, however, has traditionally been governed by manually designed rules. In this paper, we propose a deep reinforcement learning-based framework that autonomously configures and adapts the EET throughout the EC search process. The framework allows different individuals of the population to selectively attend to the global and local exemplars based on the current search state, maximizing the cooperative search outcome. Our proposed framework is characterized by its simplicity, effectiveness, and generalizability, with the potential to enhance numerous existing EC algorithms. To validate its capabilities, we apply our framework to several representative EC algorithms and conduct extensive experiments on the augmented CEC2021 benchmark. The results demonstrate significant improvements in the performance of the backbone algorithms, as well as favorable generalization across diverse problem classes, dimensions, and population sizes. Additionally, we provide an in-depth analysis of the EET issue by interpreting the learned behaviors of EC.
- [1221] arXiv:2404.08242 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: RLEMMO: Evolutionary Multimodal Optimization Assisted By Deep Reinforcement LearningComments: Accepted as full paper at GECCO 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Solving multimodal optimization problems (MMOP) requires finding all optimal solutions, which is challenging in limited function evaluations. Although existing works strike the balance of exploration and exploitation through hand-crafted adaptive strategies, they require certain expert knowledge, hence inflexible to deal with MMOP with different properties. In this paper, we propose RLEMMO, a Meta-Black-Box Optimization framework, which maintains a population of solutions and incorporates a reinforcement learning agent for flexibly adjusting individual-level searching strategies to match the up-to-date optimization status, hence boosting the search performance on MMOP. Concretely, we encode landscape properties and evolution path information into each individual and then leverage attention networks to advance population information sharing. With a novel reward mechanism that encourages both quality and diversity, RLEMMO can be effectively trained using a policy gradient algorithm. The experimental results on the CEC2013 MMOP benchmark underscore the competitive optimization performance of RLEMMO against several strong baselines.
- [1222] arXiv:2404.08262 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business DomainSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.
- [1223] arXiv:2404.08263 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Relational Prompt-based Pre-trained Language Models for Social Event DetectionComments: ACM TOIS Under ReviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Social Event Detection (SED) aims to identify significant events from social streams, and has a wide application ranging from public opinion analysis to risk management. In recent years, Graph Neural Network (GNN) based solutions have achieved state-of-the-art performance. However, GNN-based methods often struggle with noisy and missing edges between messages, affecting the quality of learned message embedding. Moreover, these methods statically initialize node embedding before training, which, in turn, limits the ability to learn from message texts and relations simultaneously. In this paper, we approach social event detection from a new perspective based on Pre-trained Language Models (PLMs), and present RPLM_SED (Relational prompt-based Pre-trained Language Models for Social Event Detection). We first propose a new pairwise message modeling strategy to construct social messages into message pairs with multi-relational sequences. Secondly, a new multi-relational prompt-based pairwise message learning mechanism is proposed to learn more comprehensive message representation from message pairs with multi-relational prompts using PLMs. Thirdly, we design a new clustering constraint to optimize the encoding process by enhancing intra-cluster compactness and inter-cluster dispersion, making the message representation more distinguishable. We evaluate the RPLM_SED on three real-world datasets, demonstrating that the RPLM_SED model achieves state-of-the-art performance in offline, online, low-resource, and long-tail distribution scenarios for social event detection tasks.
- [1224] arXiv:2404.08285 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: A Survey of Neural Network Robustness Assessment in Image RecognitionComments: Corrected typos and grammatical errors in Section 5Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models. Researchers have dedicated efforts to evaluate robustness in diverse perturbation conditions for image recognition tasks. Robustness assessment encompasses two main techniques: robustness verification/ certification for deliberate adversarial attacks and robustness testing for random data corruptions. In this survey, we present a detailed examination of both adversarial robustness (AR) and corruption robustness (CR) in neural network assessment. Analyzing current research papers and standards, we provide an extensive overview of robustness assessment in image recognition. Three essential aspects are analyzed: concepts, metrics, and assessment methods. We investigate the perturbation metrics and range representations used to measure the degree of perturbations on images, as well as the robustness metrics specifically for the robustness conditions of classification models. The strengths and limitations of the existing methods are also discussed, and some potential directions for future research are provided.
- [1225] arXiv:2404.08309 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak AttemptsComments: 4 pages, 2 figures. This paper was submitted to The 7th Deep Learning Security and Privacy Workshop (DLSP 2024) and was accepted as extended abstract, see this https URLSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.
- [1226] arXiv:2404.08313 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Integration of Semantic and Structural Knowledge in Knowledge Graph Entity TypingComments: Accepted in NAACL2024 mainSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The Knowledge Graph Entity Typing (KGET) task aims to predict missing type annotations for entities in knowledge graphs. Recent works only utilize the \textit{\textbf{structural knowledge}} in the local neighborhood of entities, disregarding \textit{\textbf{semantic knowledge}} in the textual representations of entities, relations, and types that are also crucial for type inference. Additionally, we observe that the interaction between semantic and structural knowledge can be utilized to address the false-negative problem. In this paper, we propose a novel \textbf{\underline{S}}emantic and \textbf{\underline{S}}tructure-aware KG \textbf{\underline{E}}ntity \textbf{\underline{T}}yping~{(SSET)} framework, which is composed of three modules. First, the \textit{Semantic Knowledge Encoding} module encodes factual knowledge in the KG with a Masked Entity Typing task. Then, the \textit{Structural Knowledge Aggregation} module aggregates knowledge from the multi-hop neighborhood of entities to infer missing types. Finally, the \textit{Unsupervised Type Re-ranking} module utilizes the inference results from the two models above to generate type predictions that are robust to false-negative samples. Extensive experiments show that SSET significantly outperforms existing state-of-the-art methods.
- [1227] arXiv:2404.08322 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task PromotingComments: TheWebConf 2024 (WWW '24)Journal-ref: Proceedings of TheWebConf 2024 (WWW '24), May 13--17, 2024, SingaporeSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: From-scratch name disambiguation is an essential task for establishing a reliable foundation for academic platforms. It involves partitioning documents authored by identically named individuals into groups representing distinct real-life experts. Canonically, the process is divided into two decoupled tasks: locally estimating the pairwise similarities between documents followed by globally grouping these documents into appropriate clusters. However, such a decoupled approach often inhibits optimal information exchange between these intertwined tasks. Therefore, we present BOND, which bootstraps the local and global informative signals to promote each other in an end-to-end regime. Specifically, BOND harnesses local pairwise similarities to drive global clustering, subsequently generating pseudo-clustering labels. These global signals further refine local pairwise characterizations. The experimental results establish BOND's superiority, outperforming other advanced baselines by a substantial margin. Moreover, an enhanced version, BOND+, incorporating ensemble and post-match techniques, rivals the top methods in the WhoIsWho competition.
- [1228] arXiv:2404.08359 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Improving Health Question Answering with Reliable and Time-Aware Evidence RetrievalComments: Accepted to NAACL 2024 (Findings)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.
- [1229] arXiv:2404.08361 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Large-Scale Multi-Domain Recommendation: an Automatic Domain Feature Extraction and Personalized Integration FrameworkComments: 8 pagesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Feed recommendation is currently the mainstream mode for many real-world applications (e.g., TikTok, Dianping), it is usually necessary to model and predict user interests in multiple scenarios (domains) within and even outside the application. Multi-domain learning is a typical solution in this regard. While considerable efforts have been made in this regard, there are still two long-standing challenges: (1) Accurately depicting the differences among domains using domain features is crucial for enhancing the performance of each domain. However, manually designing domain features and models for numerous domains can be a laborious task. (2) Users typically have limited impressions in only a few domains. Extracting features automatically from other domains and leveraging them to improve the predictive capabilities of each domain has consistently posed a challenging problem. In this paper, we propose an Automatic Domain Feature Extraction and Personalized Integration (DFEI) framework for the large-scale multi-domain recommendation. The framework automatically transforms the behavior of each individual user into an aggregation of all user behaviors within the domain, which serves as the domain features. Unlike offline feature engineering methods, the extracted domain features are higher-order representations and directly related to the target label. Besides, by personalized integration of domain features from other domains for each user and the innovation in the training mode, the DFEI framework can yield more accurate conversion identification. Experimental results on both public and industrial datasets, consisting of over 20 domains, clearly demonstrate that the proposed framework achieves significantly better performance compared with SOTA baselines. Furthermore, we have released the source code of the proposed framework at this https URL .
- [1230] arXiv:2404.08376 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph data augmentation with Gromow-Wasserstein BarycentersComments: 6 pages, 3 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graphs are ubiquitous in various fields, and deep learning methods have been successful applied in graph classification tasks. However, building large and diverse graph datasets for training can be expensive. While augmentation techniques exist for structured data like images or numerical data, the augmentation of graph data remains challenging. This is primarily due to the complex and non-Euclidean nature of graph data. In this paper, it has been proposed a novel augmentation strategy for graphs that operates in a non-Euclidean space. This approach leverages graphon estimation, which models the generative mechanism of networks sequences. Computational results demonstrate the effectiveness of the proposed augmentation framework in improving the performance of graph classification models. Additionally, using a non-Euclidean distance, specifically the Gromow-Wasserstein distance, results in better approximations of the graphon. This framework also provides a means to validate different graphon estimation approaches, particularly in real-world scenarios where the true graphon is unknown.
- [1231] arXiv:2404.08382 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You ThinkSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50\%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.
- [1232] arXiv:2404.08398 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Multi-Agent eXperimenter (MAX)Comments: 3 pages, no figuresSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: We present a novel multi-agent simulator named Multi-Agent eXperimenter (MAX) that is designed to simulate blockchain experiments involving large numbers of agents of different types acting in one or several environments. The architecture of MAX is highly modular, enabling easy addition of new models.
- [1233] arXiv:2404.08401 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: No Bells, Just Whistles: Sports Field Registration by Leveraging Geometric PropertiesComments: Accepted in CVPRW 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Broadcast sports field registration is traditionally addressed as a homography estimation task, mapping the visible image area to a planar field model, predominantly focusing on the main camera shot. Addressing the shortcomings of previous approaches, we propose a novel calibration pipeline enabling camera calibration using a 3D soccer field model and extending the process to assess the multiple-view nature of broadcast videos. Our approach begins with a keypoint generation pipeline derived from SoccerNet dataset annotations, leveraging the geometric properties of the court. Subsequently, we execute classical camera calibration through DLT algorithm in a minimalist fashion, without further refinement. Through extensive experimentation on real-world soccer broadcast datasets such as SoccerNet-Calibration, WorldCup 2014 and TS- WorldCup, our method demonstrates superior performance in both multiple- and single-view 3D camera calibration while maintaining competitive results in homography estimation compared to state-of-the-art techniques.
- [1234] arXiv:2404.08408 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Seismic First Break Picking in a Higher Dimension Using Deep Graph LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Geophysics (physics.geo-ph)
Abstract: Contemporary automatic first break (FB) picking methods typically analyze 1D signals, 2D source gathers, or 3D source-receiver gathers. Utilizing higher-dimensional data, such as 2D or 3D, incorporates global features, improving the stability of local picking. Despite the benefits, high-dimensional data requires structured input and increases computational demands. Addressing this, we propose a novel approach using deep graph learning called DGL-FB, constructing a large graph to efficiently extract information. In this graph, each seismic trace is represented as a node, connected by edges that reflect similarities. To manage the size of the graph, we develop a subgraph sampling technique to streamline model training and inference. Our proposed framework, DGL-FB, leverages deep graph learning for FB picking. It encodes subgraphs into global features using a deep graph encoder. Subsequently, the encoded global features are combined with local node signals and fed into a ResUNet-based 1D segmentation network for FB detection. Field survey evaluations of DGL-FB show superior accuracy and stability compared to a 2D U-Net-based benchmark method.
- [1235] arXiv:2404.08412 (cross-list from physics.flu-dyn) [ pdf , ps , html , other ]
-
Title: PiRD: Physics-informed Residual Diffusion for Flow Field ReconstructionComments: 22 pagesSubjects: Fluid Dynamics (physics.flu-dyn) ; Artificial Intelligence (cs.AI)
Abstract: The use of machine learning in fluid dynamics is becoming more common to expedite the computation when solving forward and inverse problems of partial differential equations. Yet, a notable challenge with existing convolutional neural network (CNN)-based methods for data fidelity enhancement is their reliance on specific low-fidelity data patterns and distributions during the training phase. In addition, the CNN-based method essentially treats the flow reconstruction task as a computer vision task that prioritizes the element-wise precision which lacks a physical and mathematical explanation. This dependence can dramatically affect the models' effectiveness in real-world scenarios, especially when the low-fidelity input deviates from the training data or contains noise not accounted for during training. The introduction of diffusion models in this context shows promise for improving performance and generalizability. Unlike direct mapping from a specific low-fidelity to a high-fidelity distribution, diffusion models learn to transition from any low-fidelity distribution towards a high-fidelity one. Our proposed model - Physics-informed Residual Diffusion, demonstrates the capability to elevate the quality of data from both standard low-fidelity inputs, to low-fidelity inputs with injected Gaussian noise, and randomly collected samples. By integrating physics-based insights into the objective function, it further refines the accuracy and the fidelity of the inferred high-quality data. Experimental results have shown that our approach can effectively reconstruct high-quality outcomes for two-dimensional turbulent flows from a range of low-fidelity input conditions without requiring retraining.
- [1236] arXiv:2404.08414 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Evolutionary Preference Sampling for Pareto Set LearningComments: Genetic and Evolutionary Computation Conference (GECCO '24)Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Recently, Pareto Set Learning (PSL) has been proposed for learning the entire Pareto set using a neural network. PSL employs preference vectors to scalarize multiple objectives, facilitating the learning of mappings from preference vectors to specific Pareto optimal solutions. Previous PSL methods have shown their effectiveness in solving artificial multi-objective optimization problems (MOPs) with uniform preference vector sampling. The quality of the learned Pareto set is influenced by the sampling strategy of the preference vector, and the sampling of the preference vector needs to be decided based on the Pareto front shape. However, a fixed preference sampling strategy cannot simultaneously adapt the Pareto front of multiple MOPs. To address this limitation, this paper proposes an Evolutionary Preference Sampling (EPS) strategy to efficiently sample preference vectors. Inspired by evolutionary algorithms, we consider preference sampling as an evolutionary process to generate preference vectors for neural network training. We integrate the EPS strategy into five advanced PSL methods. Extensive experiments demonstrate that our proposed method has a faster convergence speed than baseline algorithms on 7 testing problems. Our implementation is available at this https URL .
- [1237] arXiv:2404.08417 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control GuaranteesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) are increasingly capable of completing knowledge intensive tasks by recalling information from a static pretraining corpus. Here we are concerned with LLMs in the context of evolving data requirements. For instance: batches of new data that are introduced periodically; subsets of data with user-based access controls; or requirements on dynamic removal of documents with guarantees that associated knowledge cannot be recalled. We wish to satisfy these requirements while at the same time ensuring a model does not forget old information when new data becomes available. To address these issues, we introduce AdapterSwap, a training and inference scheme that organizes knowledge from a data collection into a set of low-rank adapters, which are dynamically composed during inference. Our experiments demonstrate AdapterSwap's ability to support efficient continual learning, while also enabling organizations to have fine-grained control over data access and deletion.
- [1238] arXiv:2404.08424 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization TaskSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural interactive collaboration between humans and robots. In this paper, we examine the use of Large Language Models (LLMs) for inferring human intention during a collaborative object categorization task with a physical robot. We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions and combining them with environment states and user verbal cues captured using an existing Automatic Speech Recognition (ASR) system. Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities and real-world knowledge to support intention prediction during human-robot interaction.
- [1239] arXiv:2404.08434 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: An improved tabular data generator with VAE-GMM integrationComments: 7 pages, 3 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.
- [1240] arXiv:2404.08458 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: On the Independence Assumption in Neurosymbolic LearningComments: 11 pages, 8 appendix pages, 9 figuresSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: State-of-the-art neurosymbolic learning systems use probabilistic reasoning to guide neural networks towards predictions that conform to logical constraints over symbols. Many such systems assume that the probabilities of the considered symbols are conditionally independent given the input to simplify learning and reasoning. We study and criticise this assumption, highlighting how it can hinder optimisation and prevent uncertainty quantification. We prove that loss functions bias conditionally independent neural networks to become overconfident in their predictions. As a result, they are unable to represent uncertainty over multiple valid options. Furthermore, we prove that these loss functions are difficult to optimise: they are non-convex, and their minima are usually highly disconnected. Our theoretical analysis gives the foundation for replacing the conditional independence assumption and designing more expressive neurosymbolic probabilistic models.
- [1241] arXiv:2404.08461 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: OTTER: Improving Zero-Shot Classification via Optimal TransportComments: 29 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Popular zero-shot models suffer due to artifacts inherited from pretraining. A particularly detrimental artifact, caused by unbalanced web-scale pretraining data, is mismatched label distribution. Existing approaches that seek to repair the label distribution are not suitable in zero-shot settings, as they have incompatible requirements such as access to labeled downstream task data or knowledge of the true label balance in the pretraining distribution. We sidestep these challenges and introduce a simple and lightweight approach to adjust pretrained model predictions via optimal transport. Our technique requires only an estimate of the label distribution of a downstream task. Theoretically, we characterize the improvement produced by our procedure under certain mild conditions and provide bounds on the error caused by misspecification. Empirically, we validate our method in a wide array of zero-shot image and text classification tasks, improving accuracy by 4.8% and 15.9% on average, and beating baselines like Prior Matching -- often by significant margins -- in 17 out of 21 datasets.
- [1242] arXiv:2404.08471 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Revisiting Feature Prediction for Learning Visual Representations from VideoAdrien Bardes , Quentin Garrido , Jean Ponce , Xinlei Chen , Michael Rabbat , Yann LeCun , Mahmoud Assran , Nicolas BallasSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
- [1243] arXiv:2404.08476 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Combining Statistical Depth and Fermat Distance for Uncertainty QuantificationComments: 12 pagesSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
Abstract: We measure the Out-of-domain uncertainty in the prediction of Neural Networks using a statistical notion called ``Lens Depth'' (LD) combined with Fermat Distance, which is able to capture precisely the ``depth'' of a point with respect to a distribution in feature space, without any assumption about the form of distribution. Our method has no trainable parameter. The method is applicable to any classification model as it is applied directly in feature space at test time and does not intervene in training process. As such, it does not impact the performance of the original model. The proposed method gives excellent qualitative result on toy datasets and can give competitive or better uncertainty estimation on standard deep learning datasets compared to strong baseline methods.
- [1244] arXiv:2404.08491 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-DistillationComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data. Therefore, we introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at this https URL .
- [1245] arXiv:2404.08495 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Dataset Reset Policy Optimization for RLHFJonathan D. Chang , Wenhao Zhan , Owen Oertell , Kianté Brantley , Dipendra Misra , Jason D. Lee , Wen SunComments: 28 pages, 6 tables, 3 Figures, 3 AlgorithmsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at this https URL .
- [1246] arXiv:2404.08501 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Analyzing and Overcoming Local Optima in Complex Multi-Objective Optimization by Decomposition-Based Evolutionary AlgorithmsSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: When addressing the challenge of complex multi-objective optimization problems, particularly those with non-convex and non-uniform Pareto fronts, Decomposition-based Multi-Objective Evolutionary Algorithms (MOEADs) often converge to local optima, thereby limiting solution diversity. Despite its significance, this issue has received limited theoretical exploration. Through a comprehensive geometric analysis, we identify that the traditional method of Reference Point (RP) selection fundamentally contributes to this challenge. In response, we introduce an innovative RP selection strategy, the Weight Vector-Guided and Gaussian-Hybrid method, designed to overcome the local optima issue. This approach employs a novel RP type that aligns with weight vector directions and integrates a Gaussian distribution to combine three distinct RP categories. Our research comprises two main experimental components: an ablation study involving 14 algorithms within the MOEADs framework, spanning from 2014 to 2022, to validate our theoretical framework, and a series of empirical tests to evaluate the effectiveness of our proposed method against both traditional and cutting-edge alternatives. Results demonstrate that our method achieves remarkable improvements in both population diversity and convergence.
- [1247] arXiv:2404.08513 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Adversarial Imitation Learning via BoostingComments: 19 pages, 7 figures, 4 tables, 3 algorithms, ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory.
- [1248] arXiv:2404.08517 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path ForwardSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: While Large Language Models (LLMs) have seen widespread applications across numerous fields, their limited interpretability poses concerns regarding their safe operations from multiple aspects, e.g., truthfulness, robustness, and fairness. Recent research has started developing quality assurance methods for LLMs, introducing techniques such as offline detector-based or uncertainty estimation methods. However, these approaches predominantly concentrate on post-generation analysis, leaving the online safety analysis for LLMs during the generation phase an unexplored area. To bridge this gap, we conduct in this work a comprehensive evaluation of the effectiveness of existing online safety analysis methods on LLMs. We begin with a pilot study that validates the feasibility of detecting unsafe outputs in the early generation process. Following this, we establish the first publicly available benchmark of online safety analysis for LLMs, including a broad spectrum of methods, models, tasks, datasets, and evaluation metrics. Utilizing this benchmark, we extensively analyze the performance of state-of-the-art online safety analysis methods on both open-source and closed-source LLMs. This analysis reveals the strengths and weaknesses of individual methods and offers valuable insights into selecting the most appropriate method based on specific application scenarios and task requirements. Furthermore, we also explore the potential of using hybridization methods, i.e., combining multiple methods to derive a collective safety conclusion, to enhance the efficacy of online safety analysis for LLMs. Our findings indicate a promising direction for the development of innovative and trustworthy quality assurance methodologies for LLMs, facilitating their reliable deployments across diverse domains.
- [1249] arXiv:2404.08523 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Advancing Forest Fire Prevention: Deep Reinforcement Learning for Effective Firebreak PlacementLucas Murray , Tatiana Castillo , Jaime Carrasco , Andrés Weintraub , Richard Weber , Isaac Martín de Diego , José Ramón González , Jordi García-GonzaloComments: 20 pages, 15 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Over the past decades, the increase in both frequency and intensity of large-scale wildfires due to climate change has emerged as a significant natural threat. The pressing need to design resilient landscapes capable of withstanding such disasters has become paramount, requiring the development of advanced decision-support tools. Existing methodologies, including Mixed Integer Programming, Stochastic Optimization, and Network Theory, have proven effective but are hindered by computational demands, limiting their applicability.
In response to this challenge, we propose using artificial intelligence techniques, specifically Deep Reinforcement Learning, to address the complex problem of firebreak placement in the landscape. We employ value-function based approaches like Deep Q-Learning, Double Deep Q-Learning, and Dueling Double Deep Q-Learning. Utilizing the Cell2Fire fire spread simulator combined with Convolutional Neural Networks, we have successfully implemented a computational agent capable of learning firebreak locations within a forest environment, achieving good results.
Furthermore, we incorporate a pre-training loop, initially teaching our agent to mimic a heuristic-based algorithm and observe that it consistently exceeds the performance of these solutions. Our findings underscore the immense potential of Deep Reinforcement Learning for operational research challenges, especially in fire prevention. Our approach demonstrates convergence with highly favorable results in problem instances as large as 40 x 40 cells, marking a significant milestone in applying Reinforcement Learning to this critical issue.
To the best of our knowledge, this study represents a pioneering effort in using Reinforcement Learning to address the aforementioned problem, offering promising perspectives in fire prevention and landscape management - [1250] arXiv:2404.08544 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Analyzing Decades-Long Environmental Changes in Namibia Using Archival Aerial Photography and Deep LearningGirmaw Abebe Tadesse , Caleb Robinson , Gilles Quentin Hacheme , Akram Zaytar , Rahul Dodhia , Tsering Wangyal Shawa , Juan M. Lavista Ferres , Emmanuel H. KreikeSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This study explores object detection in historical aerial photographs of Namibia to identify long-term environmental changes. Specifically, we aim to identify key objects -- Waterholes, Omuti homesteads, and Big trees -- around Oshikango in Namibia using sub-meter gray-scale aerial imagery from 1943 and 1972. In this work, we propose a workflow for analyzing historical aerial imagery using a deep semantic segmentation model on sparse hand-labels. To this end, we employ a number of strategies including class-weighting, pseudo-labeling and empirical p-value-based filtering to balance skewed and sparse representations of objects in the ground truth data. Results demonstrate the benefits of these different training strategies resulting in an average $F_1=0.661$ and $F_1=0.755$ over the three objects of interest for the 1943 and 1972 imagery, respectively. We also identified that the average size of Waterhole and Big trees increased while the average size of Omuti homesteads decreased between 1943 and 1972 reflecting some of the local effects of the massive post-Second World War economic, agricultural, demographic, and environmental changes. This work also highlights the untapped potential of historical aerial photographs in understanding long-term environmental changes beyond Namibia (and Africa). With the lack of adequate satellite technology in the past, archival aerial photography offers a great alternative to uncover decades-long environmental changes.
- [1251] arXiv:2404.08555 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMsShreyas Chaudhari , Pranjal Aggarwal , Vishvak Murahari , Tanmay Rajpurohit , Ashwin Kalyan , Karthik Narasimhan , Ameet Deshpande , Bruno Castro da SilvaSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.
- [1252] arXiv:2404.08561 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured TrafficComments: Accepted at ICRA 2024; Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.
- [1253] arXiv:2404.08562 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability DetectionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Software vulnerabilities are a challenge in cybersecurity. Manual security patches are often difficult and slow to be deployed, while new vulnerabilities are created. Binary code vulnerability detection is less studied and more complex compared to source code, and this has important practical implications. Deep learning has become an efficient and powerful tool in the security domain, where it provides end-to-end and accurate prediction. Modern deep learning approaches learn the program semantics through sequence and graph neural networks, using various intermediate representation of programs, such as abstract syntax trees (AST) or control flow graphs (CFG). Due to the complex nature of program execution, the output of an execution depends on the many program states and inputs. Also, a CFG generated from static analysis can be an overestimation of the true program flow. Moreover, the size of programs often does not allow a graph neural network with fixed layers to aggregate global information. To address these issues, we propose DeepEXE, an agent-based implicit neural network that mimics the execution path of a program. We use reinforcement learning to enhance the branching decision at every program state transition and create a dynamic environment to learn the dependency between a vulnerability and certain program states. An implicitly defined neural network enables nearly infinite state transitions until convergence, which captures the structural information at a higher level. The experiments are conducted on two semi-synthetic and two real-world datasets. We show that DeepEXE is an accurate and efficient method and outperforms the state-of-the-art vulnerability detection methods.
- [1254] arXiv:2404.08567 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model InferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In response to the rising interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), a precision-focused token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information for token importance determination. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1X higher accuracy compared to existing token pruning methods, addressing the trade-off between computational efficiency and model precision.
- [1255] arXiv:2404.08570 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Enhancing Autonomous Vehicle Training with Language Model Integration and Critical Scenario GenerationHanlin Tian , Kethan Reddy , Yuxiang Feng , Mohammed Quddus , Yiannis Demiris , Panagiotis AngeloudisComments: 7 pages, 5 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper introduces CRITICAL, a novel closed-loop framework for autonomous vehicle (AV) training and testing. CRITICAL stands out for its ability to generate diverse scenarios, focusing on critical driving situations that target specific learning and performance gaps identified in the Reinforcement Learning (RL) agent. The framework achieves this by integrating real-world traffic dynamics, driving behavior analysis, surrogate safety measures, and an optional Large Language Model (LLM) component. It is proven that the establishment of a closed feedback loop between the data generation pipeline and the training process can enhance the learning rate during training, elevate overall system performance, and augment safety resilience. Our evaluations, conducted using the Proximal Policy Optimization (PPO) and the HighwayEnv simulation environment, demonstrate noticeable performance improvements with the integration of critical case generation and LLM analysis, indicating CRITICAL's potential to improve the robustness of AV systems and streamline the generation of critical scenarios. This ultimately serves to hasten the development of AV agents, expand the general scope of RL training, and ameliorate validation efforts for AV safety.
- [1256] arXiv:2404.08579 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Small Models Are (Still) Effective Cross-Domain Argument ExtractorsComments: ACL Rolling Review Short PaperSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Effective ontology transfer has been a major goal of recent work on event argument extraction (EAE). Two methods in particular -- question answering (QA) and template infilling (TI) -- have emerged as promising approaches to this problem. However, detailed explorations of these techniques' ability to actually enable this transfer are lacking. In this work, we provide such a study, exploring zero-shot transfer using both techniques on six major EAE datasets at both the sentence and document levels. Further, we challenge the growing reliance on LLMs for zero-shot extraction, showing that vastly smaller models trained on an appropriate source ontology can yield zero-shot performance superior to that of GPT-3.5 or GPT-4.
- [1257] arXiv:2404.08582 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FashionFail: Addressing Failure Cases in Fashion Object Detection and SegmentationComments: to be published in 2024 International Joint Conference on Neural Networks (IJCNN)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of fashion object detection and segmentation for online shopping images, existing state-of-the-art fashion parsing models encounter limitations, particularly when exposed to non-model-worn apparel and close-up shots. To address these failures, we introduce FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation. The dataset is efficiently curated using our novel annotation tool that leverages recent foundation models. The primary objective of FashionFail is to serve as a test bed for evaluating the robustness of models. Our analysis reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer. Additionally, we propose a baseline approach using naive data augmentation to mitigate common failure cases and improve model robustness. Through this work, we aim to inspire and support further research in fashion item detection and segmentation for industrial applications. The dataset, annotation tool, code, and models are available at \url{ this https URL }.
- [1258] arXiv:2404.08589 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Visual Question Answering through Question-Driven Image Captions as PromptsComments: The paper has been accepted for presentation at CVPR 2024 Workshop on Prompting in VisionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{ this https URL }.
- [1259] arXiv:2404.08590 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improving Referring Image Segmentation using Vision-Aware Text FeaturesComments: 30 pages including supplementarySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at: this https URL \_RIS.
- [1260] arXiv:2404.08611 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Automatic Quantification of Serial PET/CT Images for Pediatric Hodgkin Lymphoma Patients Using a Longitudinally-Aware Segmentation NetworkXin Tie , Muheon Shin , Changhee Lee , Scott B. Perlman , Zachary Huemann , Amy J. Weisman , Sharon M. Castellino , Kara M. Kelly , Kathleen M. McCarten , Adina L. Alazraki , Junjie Hu , Steve Y. Cho , Tyler J. BradshawComments: 6 figures, 4 tables in the main textSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
Abstract: $\textbf{Purpose}$: Automatic quantification of longitudinal changes in PET scans for lymphoma patients has proven challenging, as residual disease in interim-therapy scans is often subtle and difficult to detect. Our goal was to develop a longitudinally-aware segmentation network (LAS-Net) that can quantify serial PET/CT images for pediatric Hodgkin lymphoma patients. $\textbf{Materials and Methods}$: This retrospective study included baseline (PET1) and interim (PET2) PET/CT images from 297 patients enrolled in two Children's Oncology Group clinical trials (AHOD1331 and AHOD0831). LAS-Net incorporates longitudinal cross-attention, allowing relevant features from PET1 to inform the analysis of PET2. Model performance was evaluated using Dice coefficients for PET1 and detection F1 scores for PET2. Additionally, we extracted and compared quantitative PET metrics, including metabolic tumor volume (MTV) and total lesion glycolysis (TLG) in PET1, as well as qPET and $\Delta$SUVmax in PET2, against physician measurements. We quantified their agreement using Spearman's $\rho$ correlations and employed bootstrap resampling for statistical analysis. $\textbf{Results}$: LAS-Net detected residual lymphoma in PET2 with an F1 score of 0.606 (precision/recall: 0.615/0.600), outperforming all comparator methods (P<0.01). For baseline segmentation, LAS-Net achieved a mean Dice score of 0.772. In PET quantification, LAS-Net's measurements of qPET, $\Delta$SUVmax, MTV and TLG were strongly correlated with physician measurements, with Spearman's $\rho$ of 0.78, 0.80, 0.93 and 0.96, respectively. The performance remained high, with a slight decrease, in an external testing cohort. $\textbf{Conclusion}$: LAS-Net achieved high performance in quantifying PET metrics across serial scans, highlighting the value of longitudinal awareness in evaluating multi-time-point imaging datasets.
- [1261] arXiv:2404.08627 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Is ChatGPT Transforming Academics' Writing Style?Comments: 15 pages, 19 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
Abstract: Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT-revised abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics' writing style.
- [1262] arXiv:2404.08630 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Conceptual Framework for Conversational Search and Recommendation: Conceptualizing Agent-Human Interactions During the Conversational Search ProcessJournal-ref: The Second International Workshop on Conversational Approaches to Information Retrieval (CAIR 2018) at ACM SIGIRSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: The conversational search task aims to enable a user to resolve information needs via natural language dialogue with an agent. In this paper, we aim to develop a conceptual framework of the actions and intents of users and agents explaining how these actions enable the user to explore the search space and resolve their information need. We outline the different actions and intents, before discussing key decision points in the conversation where the agent needs to decide how to steer the conversational search process to a successful and/or satisfactory conclusion. Essentially, this paper provides a conceptualization of the conversational search process between an agent and user, which provides a framework and a starting point for research, development and evaluation of conversational search agents.
- [1263] arXiv:2404.08634 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Pre-training Small Base LMs with Fewer TokensComments: 15 pages, 6 figures, 10 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM: first inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1\%) of the raw pretraining data of the larger model. We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens (and a starting few layers of larger LM of 3B parameters); we do this using a single A6000 GPU for less than half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark, the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens.
We investigate Inheritune in a slightly different setting where we train small LMs utilizing larger LMs and their full pre-training dataset. Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens. We analyze our recipe with extensive experiments and demonstrate it efficacy on diverse settings. Our code is available at this https URL . - [1264] arXiv:2404.08652 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Algorithm for AGC index management against crowded radio environmentComments: 17 pages, 15 figuresSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper describes a receiver that uses an innovative method to predict, according to history of receiver operating metrics (packet lost/well received), the optimum automatic gain control (AGC) index or most appropriate variable gain range to be used for next packet reception, anticipating an interferer appearing during the payload reception. This allows the receiver to have higher immunity to interferers even if they occur during the gain frozen payload reception period whilst still ensuring an optimum sensitivity level. As a result, the method allows setting the receiver gain to get an optimum trade-off between reception sensitivity and random interferer immunity.
- [1265] arXiv:2404.08654 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Optimal path for Biomedical Text Summarization Using Pointer GPTComments: 3 pages, 3 figuresJournal-ref: KSC2023Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Biomedical text summarization is a critical tool that enables clinicians to effectively ascertain patient status. Traditionally, text summarization has been accomplished with transformer models, which are capable of compressing long documents into brief summaries. However, transformer models are known to be among the most challenging natural language processing (NLP) tasks. Specifically, GPT models have a tendency to generate factual errors, lack context, and oversimplify words. To address these limitations, we replaced the attention mechanism in the GPT model with a pointer network. This modification was designed to preserve the core values of the original text during the summarization process. The effectiveness of the Pointer-GPT model was evaluated using the ROUGE score. The results demonstrated that Pointer-GPT outperformed the original GPT model. These findings suggest that pointer networks can be a valuable addition to EMR systems and can provide clinicians with more accurate and informative summaries of patient medical records. This research has the potential to usher in a new paradigm in EMR systems and to revolutionize the way that clinicians interact with patient medical records.
- [1266] arXiv:2404.08655 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic DetectionComments: Accepted in LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automated Essay Scoring (AES) systems are widely popular in the market as they constitute a cost-effective and time-effective option for grading systems. Nevertheless, many studies have demonstrated that the AES system fails to assign lower grades to irrelevant responses. Thus, detecting the off-topic response in automated essay scoring is crucial in practical tasks where candidates write unrelated text responses to the given task in the question. In this paper, we are proposing an unsupervised technique that jointly scores essays and detects off-topic essays. The proposed Automated Open Essay Scoring (AOES) model uses a novel topic regularization module (TRM), which can be attached on top of a transformer model, and is trained using a proposed hybrid loss function. After training, the AOES model is further used to calculate the Mahalanobis distance score for off-topic essay detection. Our proposed method outperforms the baseline we created and earlier conventional methods on two essay-scoring datasets in off-topic detection as well as on-topic scoring. Experimental evaluation results on different adversarial strategies also show how the suggested method is robust for detecting possible human-level perturbations.
- [1267] arXiv:2404.08656 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Linear Cross-document Event Coreference Resolution with X-AMRShafiuddin Rehan Ahmed , George Arthur Baker , Evi Judge , Michael Regan , Kristin Wright-Bettner , Martha Palmer , James H. MartinComments: LREC-COLING 2024 main conferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Event Coreference Resolution (ECR) as a pairwise mention classification task is expensive both for automated systems and manual annotations. The task's quadratic difficulty is exacerbated when using Large Language Models (LLMs), making prompt engineering for ECR prohibitively costly. In this work, we propose a graphical representation of events, X-AMR, anchored around individual mentions using a \textbf{cross}-document version of \textbf{A}bstract \textbf{M}eaning \textbf{R}epresentation. We then linearize the ECR with a novel multi-hop coreference algorithm over the event graphs. The event graphs simplify ECR, making it a) LLM cost-effective, b) compositional and interpretable, and c) easily annotated. For a fair assessment, we first enrich an existing ECR benchmark dataset with these event graphs using an annotator-friendly tool we introduce. Then, we employ GPT-4, the newest LLM by OpenAI, for these annotations. Finally, using the ECR algorithm, we assess GPT-4 against humans and analyze its limitations. Through this research, we aim to advance the state-of-the-art for efficient ECR and shed light on the potential shortcomings of current LLMs at this task. Code and annotations: \url{ this https URL }
- [1268] arXiv:2404.08664 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled CorpusSilvia García-Méndez , Milagros Fernández-Gavilanes , Jonathan Juncal-Martínez , Francisco J. González-Castaño , Oscar Barba SearaSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
- [1269] arXiv:2404.08668 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Comprehensive Survey on AI-based Methods for PatentsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in Artificial Intelligence (AI) and machine learning have demonstrated transformative capabilities across diverse domains. This progress extends to the field of patent analysis and innovation, where AI-based tools present opportunities to streamline and enhance important tasks in the patent cycle such as classification, retrieval, and valuation prediction. This not only accelerates the efficiency of patent researchers and applicants but also opens new avenues for technological innovation and discovery. Our survey provides a comprehensive summary of recent AI tools in patent analysis from more than 40 papers from 26 venues between 2017 and 2023. Unlike existing surveys, we include methods that work for patent image and text data. Furthermore, we introduce a novel taxonomy for the categorization based on the tasks in the patent life cycle as well as the specifics of the AI methods. This survey aims to serve as a resource for researchers, practitioners, and patent offices in the domain of AI-powered patent analysis.
- [1270] arXiv:2404.08672 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Taxonomy and Analysis of Sensitive User Queries in Generative AI SearchHwiyeol Jo , Taiwoo Park , Nayoung Choi , Changbong Kim , Ohjoon Kwon , Donghyeon Jeon , Hyunwoo Lee , Eui-Hyeon Lee , Kyoungho Shin , Sun Suk Lim , Kyungmi Kim , Jihye Lee , Sun KimSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Although there has been a growing interest among industries to integrate generative LLMs into their services, limited experiences and scarcity of resources acts as a barrier in launching and servicing large-scale LLM-based conversational services. In this paper, we share our experiences in developing and operating generative AI models within a national-scale search engine, with a specific focus on the sensitiveness of user queries. We propose a taxonomy for sensitive search queries, outline our approaches, and present a comprehensive analysis report on sensitive queries from actual users.
- [1271] arXiv:2404.08673 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Sentiment analysis and random forest to classify LLM versus human source applied to Scientific TextsComments: 12 Pages, 3 tables, 6 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: After the launch of ChatGPT v.4 there has been a global vivid discussion on the ability of this artificial intelligence powered platform and some other similar ones for the automatic production of all kinds of texts, including scientific and technical texts. This has triggered a reflection in many institutions on whether education and academic procedures should be adapted to the fact that in future many texts we read will not be written by humans (students, scholars, etc.), at least, not entirely. In this work it is proposed a new methodology to classify texts coming from an automatic text production engine or a human, based on Sentiment Analysis as a source for feature engineering independent variables and then train with them a Random Forest classification algorithm. Using four different sentiment lexicons, a number of new features where produced, and then fed to a machine learning random forest methodology, to train such a model. Results seem very convincing that this may be a promising research line to detect fraud, in such environments where human are supposed to be the source of texts.
- [1272] arXiv:2404.08674 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Effects of Different Prompts on the Quality of GPT-4 Responses to Dementia Care QuestionsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Evidence suggests that different prompts lead large language models (LLMs) to generate responses with varying quality. Yet, little is known about prompts' effects on response quality in healthcare domains. In this exploratory study, we address this gap, focusing on a specific healthcare domain: dementia caregiving. We first developed an innovative prompt template with three components: (1) system prompts (SPs) featuring 4 different roles; (2) an initialization prompt; and (3) task prompts (TPs) specifying different levels of details, totaling 12 prompt combinations. Next, we selected 3 social media posts containing complicated, real-world questions about dementia caregivers' challenges in 3 areas: memory loss and confusion, aggression, and driving. We then entered these posts into GPT-4, with our 12 prompts, to generate 12 responses per post, totaling 36 responses. We compared the word count of the 36 responses to explore potential differences in response length. Two experienced dementia care clinicians on our team assessed the response quality using a rating scale with 5 quality indicators: factual, interpretation, application, synthesis, and comprehensiveness (scoring range: 0-5; higher scores indicate higher quality).
- [1273] arXiv:2404.08675 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: RecGPT: Generative Personalized Prompts for Sequential Recommendation via ChatGPT Training ParadigmSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: ChatGPT has achieved remarkable success in natural language understanding. Considering that recommendation is indeed a conversation between users and the system with items as words, which has similar underlying pattern with ChatGPT, we design a new chat framework in item index level for the recommendation task. Our novelty mainly contains three parts: model, training and inference. For the model part, we adopt Generative Pre-training Transformer (GPT) as the sequential recommendation model and design a user modular to capture personalized information. For the training part, we adopt the two-stage paradigm of ChatGPT, including pre-training and fine-tuning. In the pre-training stage, we train GPT model by auto-regression. In the fine-tuning stage, we train the model with prompts, which include both the newly-generated results from the model and the user's feedback. For the inference part, we predict several user interests as user representations in an autoregressive manner. For each interest vector, we recall several items with the highest similarity and merge the items recalled by all interest vectors into the final result. We conduct experiments with both offline public datasets and online A/B test to demonstrate the effectiveness of our proposed method.
- [1274] arXiv:2404.08677 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: PMG : Personalized Multimodal Generation with Large Language ModelsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The emergence of large language models (LLMs) has revolutionized the capabilities of text comprehension and generation. Multi-modal generation attracts great attention from both the industry and academia, but there is little work on personalized generation, which has important applications such as recommender systems. This paper proposes the first method for personalized multimodal generation using LLMs, showcases its applications and validates its performance via an extensive experimental study on two datasets. The proposed method, Personalized Multimodal Generation (PMG for short) first converts user behaviors (e.g., clicks in recommender systems or conversations with a virtual assistant) into natural language to facilitate LLM understanding and extract user preference descriptions. Such user preferences are then fed into a generator, such as a multimodal LLM or diffusion model, to produce personalized content. To capture user preferences comprehensively and accurately, we propose to let the LLM output a combination of explicit keywords and implicit embeddings to represent user preferences. Then the combination of keywords and embeddings are used as prompts to condition the generator. We optimize a weighted sum of the accuracy and preference scores so that the generated content has a good balance between them. Compared to a baseline method without personalization, PMG has a significant improvement on personalization for up to 8% in terms of LPIPS while retaining the accuracy of generation.
- [1275] arXiv:2404.08679 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Your Finetuned Large Language Model is Already a Powerful Out-of-distribution DetectorSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: We revisit the likelihood ratio between a pretrained large language model (LLM) and its finetuned variant as a criterion for out-of-distribution (OOD) detection. The intuition behind such a criterion is that, the pretrained LLM has the prior knowledge about OOD data due to its large amount of training data, and once finetuned with the in-distribution data, the LLM has sufficient knowledge to distinguish their difference. Leveraging the power of LLMs, we show that, for the first time, the likelihood ratio can serve as an effective OOD detector. Moreover, we apply the proposed LLM-based likelihood ratio to detect OOD questions in question-answering (QA) systems, which can be used to improve the performance of specialized LLMs for general questions. Given that likelihood can be easily obtained by the loss functions within contemporary neural network frameworks, it is straightforward to implement this approach in practice. Since both the pretrained LLMs and its various finetuned models are available, our proposed criterion can be effortlessly incorporated for OOD detection without the need for further training. We conduct comprehensive evaluation across on multiple settings, including far OOD, near OOD, spam detection, and QA scenarios, to demonstrate the effectiveness of the method.
- [1276] arXiv:2404.08684 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Is English the New Programming Language? How About Pseudo-code Engineering?Journal-ref: Acta Sci. (Canoas), 26(1), 157-204, Jan./Feb. 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Background: The integration of artificial intelligence (AI) into daily life, particularly through chatbots utilizing natural language processing (NLP), presents both revolutionary potential and unique challenges. This intended to investigate how different input forms impact ChatGPT, a leading language model by OpenAI, performance in understanding and executing complex, multi-intention tasks. Design: Employing a case study methodology supplemented by discourse analysis, the research analyzes ChatGPT's responses to inputs varying from natural language to pseudo-code engineering. The study specifically examines the model's proficiency across four categories: understanding of intentions, interpretability, completeness, and creativity. Setting and Participants: As a theoretical exploration of AI interaction, this study focuses on the analysis of structured and unstructured inputs processed by ChatGPT, without direct human participants. Data collection and analysis: The research utilizes synthetic case scenarios, including the organization of a "weekly meal plan" and a "shopping list," to assess ChatGPT's response to prompts in both natural language and pseudo-code engineering. The analysis is grounded in the identification of patterns, contradictions, and unique response elements across different input formats. Results: Findings reveal that pseudo-code engineering inputs significantly enhance the clarity and determinism of ChatGPT's responses, reducing ambiguity inherent in natural language. Enhanced natural language, structured through prompt engineering techniques, similarly improves the model's interpretability and creativity. Conclusions: The study underscores the potential of pseudo-code engineering in refining human-AI interaction and achieving more deterministic, concise, and direct outcomes, advocating for its broader application across disciplines requiring precise AI responses.
- [1277] arXiv:2404.08687 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Survey of Reasoning for Substitution Relationships: Definitions, Methods, and DirectionsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Substitute relationships are fundamental to people's daily lives across various domains. This study aims to comprehend and predict substitute relationships among products in diverse fields, extensively analyzing the application of machine learning algorithms, natural language processing, and other technologies. By comparing model methodologies across different domains, such as defining substitutes, representing and learning substitute relationships, and substitute reasoning, this study offers a methodological foundation for delving deeper into substitute relationships. Through ongoing research and innovation, we can further refine the personalization and accuracy of substitute recommendation systems, thus advancing the development and application of this field.
- [1278] arXiv:2404.08690 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Building a Robust Toxicity PredictorComments: ACL 2023 /Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. ToxicTrap exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98\% attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks.
- [1279] arXiv:2404.08692 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Apollonion: Profile-centric Dialog AgentSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The emergence of Large Language Models (LLMs) has innovated the development of dialog agents. Specially, a well-trained LLM, as a central process unit, is capable of providing fluent and reasonable response for user's request. Besides, auxiliary tools such as external knowledge retrieval, personalized character for vivid response, short/long-term memory for ultra long context management are developed, completing the usage experience for LLM-based dialog agents. However, the above-mentioned techniques does not solve the issue of \textbf{personalization from user perspective}: agents response in a same fashion to different users, without consideration of their features, such as habits, interests and past experience. In another words, current implementation of dialog agents fail in ``knowing the user''. The capacity of well-description and representation of user is under development. In this work, we proposed a framework for dialog agent to incorporate user profiling (initialization, update): user's query and response is analyzed and organized into a structural user profile, which is latter served to provide personal and more precise response. Besides, we proposed a series of evaluation protocols for personalization: to what extend the response is personal to the different users.
The framework is named as \method{}, inspired by inscription of ``Know Yourself'' in the temple of Apollo (also known as \method{}) in Ancient Greek. Few works have been conducted on incorporating personalization into LLM, \method{} is a pioneer work on guiding LLM's response to meet individuation via the application of dialog agents, with a set of evaluation methods for measurement in personalization. - [1280] arXiv:2404.08695 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Enhancing Question Answering for Enterprise Knowledge Bases using Large Language ModelsComments: DASFAA 2024 AcceptedSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Efficient knowledge management plays a pivotal role in augmenting both the operational efficiency and the innovative capacity of businesses and organizations. By indexing knowledge through vectorization, a variety of knowledge retrieval methods have emerged, significantly enhancing the efficacy of knowledge management systems. Recently, the rapid advancements in generative natural language processing technologies paved the way for generating precise and coherent answers after retrieving relevant documents tailored to user queries. However, for enterprise knowledge bases, assembling extensive training data from scratch for knowledge retrieval and generation is a formidable challenge due to the privacy and security policies of private data, frequently entailing substantial costs. To address the challenge above, in this paper, we propose EKRG, a novel Retrieval-Generation framework based on large language models (LLMs), expertly designed to enable question-answering for Enterprise Knowledge bases with limited annotation costs. Specifically, for the retrieval process, we first introduce an instruction-tuning method using an LLM to generate sufficient document-question pairs for training a knowledge retriever. This method, through carefully designed instructions, efficiently generates diverse questions for enterprise knowledge bases, encompassing both fact-oriented and solution-oriented knowledge. Additionally, we develop a relevance-aware teacher-student learning strategy to further enhance the efficiency of the training process. For the generation process, we propose a novel chain of thought (CoT) based fine-tuning method to empower the LLM-based generator to adeptly respond to user questions using retrieved documents. Finally, extensive experiments on real-world datasets have demonstrated the effectiveness of our proposed framework.
- [1281] arXiv:2404.08699 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In an era where language models are increasingly integrated into decision-making and communication, understanding the biases within Large Language Models (LLMs) becomes imperative, especially when these models are applied in the economic and political domains. This work investigates the impact of fine-tuning and data selection on economic and political biases in LLM. We explore the methodological aspects of biasing LLMs towards specific ideologies, mindful of the biases that arise from their extensive training on diverse datasets. Our approach, distinct from earlier efforts that either focus on smaller models or entail resource-intensive pre-training, employs Parameter-Efficient Fine-Tuning (PEFT) techniques. These techniques allow for the alignment of LLMs with targeted ideologies by modifying a small subset of parameters. We introduce a systematic method for dataset selection, annotation, and instruction tuning, and we assess its effectiveness through both quantitative and qualitative evaluations. Our work analyzes the potential of embedding specific biases into LLMs and contributes to the dialogue on the ethical application of AI, highlighting the importance of deploying AI in a manner that aligns with societal values.
- [1282] arXiv:2404.08700 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive KnowledgeSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We study the appropriateness of Large Language Models (LLMs) as knowledge repositories. We focus on the challenge of maintaining LLMs' factual knowledge up-to-date over time. Motivated by the lack of studies on identifying outdated knowledge within LLMs, we design and develop a dynamic benchmark with up-to-date ground truth answers for each target factual question. We evaluate eighteen open-source and closed-source state-of-the-art LLMs on time-sensitive knowledge retrieved in real-time from Wikidata. We select time-sensitive domain facts in politics, sports, and organizations, and estimate the recency of the information learned by the model during pre-training\fine-tuning. In the second contribution, we evaluate the effectiveness of knowledge editing methods for aligning LLMs with up-to-date factual knowledge and compare their performance with Retrieval Augmented Generation. The dynamic benchmark is designed to be used as-is to assess LLMs's up-to-dateness, as well as to be extended to other domains by sharing the code, the dataset, as well as evaluation and visualization scripts.
- [1283] arXiv:2404.08704 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT PromptingAvinash Anand , Janak Kapuriya , Apoorv Singh , Jay Saraf , Naman Lal , Astha Verma , Rushali Gupta , Rajiv ShahSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
- [1284] arXiv:2404.08705 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource RegionsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Addressing the imminent shortfall of 10 million health workers by 2030, predominantly in Low- and Middle-Income Countries (LMICs), this paper introduces an innovative approach that harnesses the power of Large Language Models (LLMs) integrated with machine translation models. This solution is engineered to meet the unique needs of Community Health Workers (CHWs), overcoming language barriers, cultural sensitivities, and the limited availability of medical dialog datasets. I have crafted a model that not only boasts superior translation capabilities but also undergoes rigorous fine-tuning on open-source datasets to ensure medical accuracy and is equipped with comprehensive safety features to counteract the risks of misinformation.
Featuring a modular design, this approach is specifically structured for swift adaptation across various linguistic and cultural contexts, utilizing open-source components to significantly reduce healthcare operational costs. This strategic innovation markedly improves the accessibility and quality of healthcare services by providing CHWs with contextually appropriate medical knowledge and diagnostic tools. This paper highlights the transformative impact of this context-aware LLM, underscoring its crucial role in addressing the global healthcare workforce deficit and propelling forward healthcare outcomes in LMICs. - [1285] arXiv:2404.08707 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Large Language Model Can Continue Evolving From MistakesComments: For catastrophic forgetting experiments are lacking and need REVISIONSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) demonstrate impressive performance in various downstream tasks. However, they may still generate incorrect responses in certain scenarios due to the knowledge deficiencies and the flawed pre-training data. Continual Learning (CL) is a commonly used method to address this issue. Traditional CL is task-oriented, using novel or factually accurate data to retrain LLMs from scratch. However, this method requires more task-related training data and incurs expensive training costs. To address this challenge, we propose the Continue Evolving from Mistakes (CEM) method, inspired by the 'summarize mistakes' learning skill, to achieve iterative refinement of LLMs. Specifically, the incorrect responses of LLMs indicate knowledge deficiencies related to the questions. Therefore, we collect corpora with these knowledge from multiple data sources and follow it up with iterative supplementary training for continuous, targeted knowledge updating and supplementation. Meanwhile, we developed two strategies to construct supplementary training sets to enhance the LLM's understanding of the corpus and prevent catastrophic forgetting. We conducted extensive experiments to validate the effectiveness of this CL method. In the best case, our method resulted in a 17.00\% improvement in the accuracy of the LLM.
- [1286] arXiv:2404.08708 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Multi-scale Topology Optimization using Neural NetworksSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: A long-standing challenge is designing multi-scale structures with good connectivity between cells while optimizing each cell to reach close to the theoretical performance limit. We propose a new method for direct multi-scale topology optimization using neural networks. Our approach focuses on inverse homogenization that seamlessly maintains compatibility across neighboring microstructure cells. Our approach consists of a topology neural network that optimizes the microstructure shape and distribution across the design domain as a continuous field. Each microstructure cell is optimized based on a specified elasticity tensor that also accommodates in-plane rotations. The neural network takes as input the local coordinates within a cell to represent the density distribution within a cell, as well as the global coordinates of each cell to design spatially varying microstructure cells. As such, our approach models an n-dimensional multi-scale optimization problem as a 2n-dimensional inverse homogenization problem using neural networks. During the inverse homogenization of each unit cell, we extend the boundary of each cell by scaling the input coordinates such that the boundaries of neighboring cells are combined. Inverse homogenization on the combined cell improves connectivity. We demonstrate our method through the design and optimization of graded multi-scale structures.
- [1287] arXiv:2404.08710 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Do Large Language Models Learn Human-Like Strategic Preferences?Subjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: We evaluate whether LLMs learn to make human-like preference judgements in strategic scenarios as compared with known empirical results. We show that Solar and Mistral exhibit stable value-based preference consistent with human in the prisoner's dilemma, including stake-size effect, and traveler's dilemma, including penalty-size effect. We establish a relationship between model size, value based preference, and superficiality. Finally, we find that models that tend to be less brittle were trained with sliding window attention. Additionally, we contribute a novel method for constructing preference relations from arbitrary LLMs and support for a hypothesis regarding human behavior in the traveler's dilemma.
- [1288] arXiv:2404.08721 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Beyond One-Size-Fits-All: Adapting Counterfactual Explanations to User ObjectivesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Explainable Artificial Intelligence (XAI) has emerged as a critical area of research aimed at enhancing the transparency and interpretability of AI systems. Counterfactual Explanations (CFEs) offer valuable insights into the decision-making processes of machine learning algorithms by exploring alternative scenarios where certain factors differ. Despite the growing popularity of CFEs in the XAI community, existing literature often overlooks the diverse needs and objectives of users across different applications and domains, leading to a lack of tailored explanations that adequately address the different use cases. In this paper, we advocate for a nuanced understanding of CFEs, recognizing the variability in desired properties based on user objectives and target applications. We identify three primary user objectives and explore the desired characteristics of CFEs in each case. By addressing these differences, we aim to design more effective and tailored explanations that meet the specific needs of users, thereby enhancing collaboration with AI systems.
- [1289] arXiv:2404.08726 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: An Integrated Toolbox for Creating Neuromorphic Edge ApplicationsLars Niedermeier (1), Jeffrey L. Krichmar (2) ((1) Niedermeier Consulting, Zurich, ZH, Switzerland, (2) Department of Cognitive Sciences, Department of Computer Science, University of California, Irvine, CA, USA)Comments: 8 pages, 5 figures, NICE 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Spiking Neural Networks (SNNs) and neuromorphic models are more efficient and have more biological realism than the activation functions typically used in deep neural networks, transformer models and generative AI. SNNs have local learning rules, are able to learn on small data sets, and can adapt through neuromodulation. Although research has shown their advantages, there are still few compelling practical applications, especially at the edge where sensors and actuators need to be processed in a timely fashion. One reason for this might be that SNNs are much more challenging to understand, build, and operate due to their intrinsic properties. For instance, the mathematical foundation involves differential equations rather than basic activation functions. To address these challenges, we have developed CARLsim++. It is an integrated toolbox that enables fast and easy creation of neuromorphic applications. It encapsulates the mathematical intrinsics and low-level C++ programming by providing a graphical user interface for users who do not have a background in software engineering but still want to create neuromorphic models. Developers can easily configure inputs and outputs to devices and robots. These can be accurately simulated before deploying on physical devices. CARLsim++ can lead to rapid development of neuromorphic applications for simulation or edge processing.
- [1290] arXiv:2404.08727 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Can LLMs substitute SQL? Comparing Resource Utilization of Querying LLMs versus Traditional Relational DatabasesComments: 13 pages, 2 figures, 5 tablesSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) can automate or substitute different types of tasks in the software engineering process. This study evaluates the resource utilization and accuracy of LLM in interpreting and executing natural language queries against traditional SQL within relational database management systems. We empirically examine the resource utilization and accuracy of nine LLMs varying from 7 to 34 Billion parameters, including Llama2 7B, Llama2 13B, Mistral, Mixtral, Optimus-7B, SUS-chat-34B, platypus-yi-34b, NeuralHermes-2.5-Mistral-7B and Starling-LM-7B-alpha, using a small transaction dataset. Our findings indicate that using LLMs for database queries incurs significant energy overhead (even small and quantized models), making it an environmentally unfriendly approach. Therefore, we advise against replacing relational databases with LLMs due to their substantial resource utilization.
- [1291] arXiv:2404.08747 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Observation-specific explanations through scattered data approximationSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
Abstract: This work introduces the definition of observation-specific explanations to assign a score to each data point proportional to its importance in the definition of the prediction process. Such explanations involve the identification of the most influential observations for the black-box model of interest. The proposed method involves estimating these explanations by constructing a surrogate model through scattered data approximation utilizing the orthogonal matching pursuit algorithm. The proposed approach is validated on both simulated and real-world datasets.
- [1292] arXiv:2404.08755 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Training a Vision Language Model as Smartphone AssistantComments: ICLR 2024 workshop on Generative Models for Decision MakingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Abstract: Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs) and present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, encompassing gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.
- [1293] arXiv:2404.08760 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Generation Gap:Exploring Age Bias in Large Language ModelsComments: 4 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we explore the alignment of values in Large Language Models (LLMs) with specific age groups, leveraging data from the World Value Survey across thirteen categories. Through a diverse set of prompts tailored to ensure response robustness, we find a general inclination of LLM values towards younger demographics. Additionally, we explore the impact of incorporating age identity information in prompts and observe challenges in mitigating value discrepancies with different age cohorts. Our findings highlight the age bias in LLMs and provide insights for future work.
- [1294] arXiv:2404.08786 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: NeuroLGP-SM: Scalable Surrogate-Assisted Neuroevolution for Deep Neural NetworksComments: Accept for IEEE Congress on Evolutionary Computation (CEC) (CEC 2024), Yokohama, Japan, 8 pages, 5 figures, 2 tables (Fixed typo x' -> x*)Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Evolutionary Algorithms (EAs) play a crucial role in the architectural configuration and training of Artificial Deep Neural Networks (DNNs), a process known as neuroevolution. However, neuroevolution is hindered by its inherent computational expense, requiring multiple generations, a large population, and numerous epochs. The most computationally intensive aspect lies in evaluating the fitness function of a single candidate solution. To address this challenge, we employ Surrogate-assisted EAs (SAEAs). While a few SAEAs approaches have been proposed in neuroevolution, none have been applied to truly large DNNs due to issues like intractable information usage. In this work, drawing inspiration from Genetic Programming semantics, we use phenotypic distance vectors, outputted from DNNs, alongside Kriging Partial Least Squares (KPLS), an approach that is effective in handling these large vectors, making them suitable for search. Our proposed approach, named Neuro-Linear Genetic Programming surrogate model (NeuroLGP-SM), efficiently and accurately estimates DNN fitness without the need for complete evaluations. NeuroLGP-SM demonstrates competitive or superior results compared to 12 other methods, including NeuroLGP without SM, convolutional neural networks, support vector machines, and autoencoders. Additionally, it is worth noting that NeuroLGP-SM is 25% more energy-efficient than its NeuroLGP counterpart. This efficiency advantage adds to the overall appeal of our proposed NeuroLGP-SM in optimising the configuration of large DNNs.
- [1295] arXiv:2404.08789 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Differentiable and Stable Long-Range Tracking of Multiple Posterior ModesComments: Neurips 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Particle filters flexibly represent multiple posterior modes nonparametrically, via a collection of weighted samples, but have classically been applied to tracking problems with known dynamics and observation likelihoods. Such generative models may be inaccurate or unavailable for high-dimensional observations like images. We instead leverage training data to discriminatively learn particle-based representations of uncertainty in latent object states, conditioned on arbitrary observations via deep neural network encoders. While prior discriminative particle filters have used heuristic relaxations of discrete particle resampling, or biased learning by truncating gradients at resampling steps, we achieve unbiased and low-variance gradient estimates by representing posteriors as continuous mixture densities. Our theory and experiments expose dramatic failures of existing reparameterization-based estimators for mixture gradients, an issue we address via an importance-sampling gradient estimator. Unlike standard recurrent neural networks, our mixture density particle filter represents multimodal uncertainty in continuous latent states, improving accuracy and robustness. On a range of challenging tracking and robot localization problems, our approach achieves dramatic improvements in accuracy, while also showing much greater stability across multiple training runs.
- [1296] arXiv:2404.08799 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Semantic Approach to Quantifying the Consistency of Diffusion Model Image GenerationComments: Accepted to 2024 CVPR 3rd Explainable AI for Computer Vision (XAI4CV) WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: In this study, we identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models. We propose a semantic approach, using a pairwise mean CLIP (Contrastive Language-Image Pretraining) score as our semantic consistency score. We applied this metric to compare two state-of-the-art open-source image generation diffusion models, Stable Diffusion XL and PixArt-{\alpha}, and we found statistically significant differences between the semantic consistency scores for the models. Agreement between the Semantic Consistency Score selected model and aggregated human annotations was 94%. We also explored the consistency of SDXL and a LoRA-fine-tuned version of SDXL and found that the fine-tuned model had significantly higher semantic consistency in generated images. The Semantic Consistency Score proposed here offers a measure of image generation alignment, facilitating the evaluation of model architectures for specific tasks and aiding in informed decision-making regarding model selection.
- [1297] arXiv:2404.08811 (cross-list from cs.ET) [ pdf , ps , html , other ]
-
Title: Reducing the Barriers to Entry for Foundation Model TrainingSubjects: Emerging Technologies (cs.ET) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Abstract: The world has recently witnessed an unprecedented acceleration in demands for Machine Learning and Artificial Intelligence applications. This spike in demand has imposed tremendous strain on the underlying technology stack in supply chain, GPU-accelerated hardware, software, datacenter power density, and energy consumption. If left on the current technological trajectory, future demands show insurmountable spending trends, further limiting market players, stifling innovation, and widening the technology gap. To address these challenges, we propose a fundamental change in the AI training infrastructure throughout the technology ecosystem. The changes require advancements in supercomputing and novel AI training approaches, from high-end software to low-level hardware, microprocessor, and chip design, while advancing the energy efficiency required by a sustainable infrastructure. This paper presents the analytical framework that quantitatively highlights the challenges and points to the opportunities to reduce the barriers to entry for training large language models.
- [1298] arXiv:2404.08814 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: E3: Ensemble of Expert Embedders for Adapting Synthetic Image Detectors to New Generators Using Limited DataComments: 11 pages, 4 figures, To be published in CVPRWMF24Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As generative AI progresses rapidly, new synthetic image generators continue to emerge at a swift pace. Traditional detection methods face two main challenges in adapting to these generators: the forensic traces of synthetic images from new techniques can vastly differ from those learned during training, and access to data for these new generators is often limited. To address these issues, we introduce the Ensemble of Expert Embedders (E3), a novel continual learning framework for updating synthetic image detectors. E3 enables the accurate detection of images from newly emerged generators using minimal training data. Our approach does this by first employing transfer learning to develop a suite of expert embedders, each specializing in the forensic traces of a specific generator. Then, all embeddings are jointly analyzed by an Expert Knowledge Fusion Network to produce accurate and reliable detection decisions. Our experiments demonstrate that E3 outperforms existing continual learning methods, including those developed specifically for synthetic image detection.
- [1299] arXiv:2404.08825 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Inverse Kinematics for Neuro-Robotic Grasping with Humanoid Embodied AgentsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces a novel zero-shot motion planning method that allows users to quickly design smooth robot motions in Cartesian space. A Bézier curve-based Cartesian plan is transformed into a joint space trajectory by our neuro-inspired inverse kinematics (IK) method CycleIK, for which we enable platform independence by scaling it to arbitrary robot designs. The motion planner is evaluated on the physical hardware of the two humanoid robots NICO and NICOL in a human-in-the-loop grasping scenario. Our method is deployed with an embodied agent that is a large language model (LLM) at its core. We generalize the embodied agent, that was introduced for NICOL, to also be embodied by NICO. The agent can execute a discrete set of physical actions and allows the user to verbally instruct various different robots. We contribute a grasping primitive to its action space that allows for precise manipulation of household objects. The new CycleIK method is compared to popular numerical IK solvers and state-of-the-art neural IK methods in simulation and is shown to be competitive with or outperform all evaluated methods when the algorithm runtime is very short. The grasping primitive is evaluated on both NICOL and NICO robots with a reported grasp success of 72% to 82% for each robot, respectively.
- [1300] arXiv:2404.08828 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Hindsight PRIORs for Reward Learning from Human PreferencesComments: International Conference on Learning Representations, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision. Code repository can be found at this https URL .
- [1301] arXiv:2404.08836 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: BERT-LSH: Reducing Absolute Compute For AttentionComments: 10 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model's ability to generalize from its training data. For more information, visit our GitHub repository: this https URL
- [1302] arXiv:2404.08844 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic MappingComments: 8 pagesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The integration of optimization method and generative models has significantly advanced dexterous manipulation techniques for five-fingered hand grasping. Yet, the application of these techniques in cluttered environments is a relatively unexplored area. To address this research gap, we have developed a novel method for generating five-fingered hand grasp samples in cluttered settings. This method emphasizes simulated grasp quality and the nuanced interaction between the hand and surrounding objects. A key aspect of our approach is our data generation method, capable of estimating contact spatial and semantic representations and affordance grasps based on object affordance information. Furthermore, our Contact Semantic Conditional Variational Autoencoder (CoSe-CVAE) network is adept at creating comprehensive contact maps from point clouds, incorporating both spatial and semantic data. We introduce a unique grasp detection technique that efficiently formulates mechanical hand grasp poses from these maps. Additionally, our evaluation model is designed to assess grasp quality and collision probability, significantly improving the practicality of five-fingered hand grasping in complex scenarios. Our data generation method outperforms previous datasets in grasp diversity, scene diversity, modality diversity. Our grasp generation method has demonstrated remarkable success, outperforming established baselines with 81.0% average success rate in real-world single-object grasping and 75.3% success rate in multi-object grasping. The dataset and supplementary materials can be found at this https URL , and we will release the code upon publication.
- [1303] arXiv:2404.08856 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: On Speculative Decoding for Multimodal Large Language ModelsComments: Accepted as a spotlight paper to ELVM workshop at CVPR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37$\times$ using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.
- [1304] arXiv:2404.08857 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Voice Attribute Editing with Text PromptSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website.
- [1305] arXiv:2404.08858 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Lightweight Spatiotemporal Network for Online Eye Tracking with Event CameraComments: 8 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.
- [1306] arXiv:2404.08866 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: An evaluation framework for synthetic data generation modelsComments: This paper has been accepted for presentation at IFIP International Conference on Artificial Intelligence Applications and InnovationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models' ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data. The implementation code can be found in this https URL .
- [1307] arXiv:2404.08872 (cross-list from cond-mat.mtrl-sci) [ pdf , ps , other ]
-
Title: Enhanced Hydrogen Evolution Activity of MOS$_2$-rGO Composite Synthesized via Hydrothermal TechniqueComments: This research is an excerpt from the report of TARP (Technical Answers for Real-World Problems)Subjects: Materials Science (cond-mat.mtrl-sci) ; Artificial Intelligence (cs.AI)
Abstract: Hydrogen evolution reaction (HER) has emerged as a promising technique for the production of clean and sustainable energy. In recent years, researchers have been exploring various materials for efficient HER activity. In this study, we report the synthesis of two different materials, namely MOS$_2$ and MoS$_2$-rGO, through a hydrothermal technique. X-ray diffraction (XRD), Fourier-transform infrared (FTIR) spectroscopy, and Raman spectroscopy were used to characterize the materials. XRD analysis revealed the formation of hexagonal MOS$_2$ with a high degree of crystallinity. FTIR analysis confirmed the presence of Mo-S bonds, while Raman spectroscopy provided evidence for the formation of MOS$_2$.To evaluate the HER activity of the materials, linear sweep voltammetry (LSV) was performed. The results showed that MOS$_2$ and MOS$_2$-rGO had good HER activity with low onset potentials and high current densities. The MOS$_2$-rGO material showed improved HER activity compared to MOS$_2$, indicating the potential of graphene oxide as a co-catalyst to enhance the performance of MOS$_2$.
- [1308] arXiv:2404.08886 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLMHenry Peng Zou , Gavin Heqing Yu , Ziwei Fan , Dan Bu , Han Liu , Peng Dai , Dongmei Jia , Cornelia CarageaComments: Accepted by NAACL 2024 Industry TrackSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: In e-commerce, accurately extracting product attribute values from multimodal data is crucial for improving user experience and operational efficiency of retailers. However, previous approaches to multimodal attribute value extraction often struggle with implicit attribute values embedded in images or text, rely heavily on extensive labeled data, and can easily confuse similar attribute values. To address these issues, we introduce EIVEN, a data- and parameter-efficient generative framework that pioneers the use of multimodal LLM for implicit attribute value extraction. EIVEN leverages the rich inherent knowledge of a pre-trained LLM and vision encoder to reduce reliance on labeled data. We also introduce a novel Learning-by-Comparison technique to reduce model confusion by enforcing attribute value comparison and difference identification. Additionally, we construct initial open-source datasets for multimodal implicit attribute value extraction. Our extensive experiments reveal that EIVEN significantly outperforms existing methods in extracting implicit attribute values while requiring less labeled data.
- [1309] arXiv:2404.08892 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ChangeAnywhere: Sample Generation for Remote Sensing Change Detection via Semantic Latent Diffusion ModelComments: Concise manuscript version of ChangeAnywhereSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Remote sensing change detection (CD) is a pivotal technique that pinpoints changes on a global scale based on multi-temporal images. With the recent expansion of deep learning, supervised deep learning-based CD models have shown satisfactory performance. However, CD sample labeling is very time-consuming as it is densely labeled and requires expert knowledge. To alleviate this problem, we introduce ChangeAnywhere, a novel CD sample generation method using the semantic latent diffusion model and single-temporal images. Specifically, ChangeAnywhere leverages the relative ease of acquiring large single-temporal semantic datasets to generate large-scale, diverse, and semantically annotated bi-temporal CD datasets. ChangeAnywhere captures the two essentials of CD samples, i.e., change implies semantically different, and non-change implies reasonable change under the same semantic constraints. We generated ChangeAnywhere-100K, the largest synthesis CD dataset with 100,000 pairs of CD samples based on the proposed method. The ChangeAnywhere-100K significantly improved both zero-shot and few-shot performance on two CD benchmark datasets for various deep learning-based CD models, as demonstrated by transfer experiments. This paper delineates the enormous potential of ChangeAnywhere for CD sample generation and demonstrates the subsequent enhancement of model performance. Therefore, ChangeAnywhere offers a potent tool for remote sensing CD. All codes and pre-trained models will be available at this https URL .
- [1310] arXiv:2404.08931 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image ModelingComments: The paper has been accepted to CVPR 2024 5th Workshop on Vision for Agriculture as an Oral PaperSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Detecting various types of stresses (nutritional, water, nitrogen, etc.) in agricultural fields is critical for farmers to ensure maximum productivity. However, stresses show up in different shapes and sizes across different crop types and varieties. Hence, this is posed as an anomaly detection task in agricultural images. Accurate anomaly detection in agricultural UAV images is vital for early identification of field irregularities. Traditional supervised learning faces challenges in adapting to diverse anomalies, necessitating extensive annotated data. In this work, we overcome this limitation with self-supervised learning using a masked image modeling approach. Masked Autoencoders (MAE) extract meaningful normal features from unlabeled image samples which produces high reconstruction error for the abnormal pixels during reconstruction. To remove the need of using only ``normal" data while training, we use an anomaly suppression loss mechanism that effectively minimizes the reconstruction of anomalous pixels and allows the model to learn anomalous areas without explicitly separating ``normal" images for training. Evaluation on the Agriculture-Vision data challenge shows a mIOU score improvement in comparison to prior state of the art in unsupervised and self-supervised methods. A single model generalizes across all the anomaly categories in the Agri-Vision Challenge Dataset
- [1311] arXiv:2404.08937 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-language model which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and output class predictions. Query tokens are initialised using a standardised ethogram of chimpanzee behaviour, rather than using random or name-based initialisations. In addition, the effect of initialising query tokens using a masked language model fine-tuned on a text corpus of known behavioural patterns is explored. We evaluate our system on the PanAf500 and PanAf20K datasets and demonstrate the performance benefits of our multi-modal decoding approach and query initialisation strategy on multi-class and multi-label recognition tasks, respectively. Results and ablations corroborate performance improvements. We achieve state-of-the-art performance over vision and vision-language models in top-1 accuracy (+6.34%) on PanAf500 and overall (+1.1%) and tail-class (+2.26%) mean average precision on PanAf20K. We share complete source code and network weights for full reproducibility of results and easy utilisation.
- [1312] arXiv:2404.08939 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: NeurIT: Pushing the Limit of Neural Inertial Tracking for Indoor Robotic IoTSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Inertial tracking is vital for robotic IoT and has gained popularity thanks to the ubiquity of low-cost Inertial Measurement Units (IMUs) and deep learning-powered tracking algorithms. Existing works, however, have not fully utilized IMU measurements, particularly magnetometers, nor maximized the potential of deep learning to achieve the desired accuracy. To enhance the tracking accuracy for indoor robotic applications, we introduce NeurIT, a sequence-to-sequence framework that elevates tracking accuracy to a new level. NeurIT employs a Time-Frequency Block-recurrent Transformer (TF-BRT) at its core, combining the power of recurrent neural network (RNN) and Transformer to learn representative features in both time and frequency domains. To fully utilize IMU information, we strategically employ body-frame differentiation of the magnetometer, which considerably reduces the tracking error. NeurIT is implemented on a customized robotic platform and evaluated in various indoor environments. Experimental results demonstrate that NeurIT achieves a mere 1-meter tracking error over a 300-meter distance. Notably, it significantly outperforms state-of-the-art baselines by 48.21% on unseen data. NeurIT also performs comparably to the visual-inertial approach (Tango Phone) in vision-favored conditions and surpasses it in plain environments. We believe NeurIT takes an important step forward toward practical neural inertial tracking for ubiquitous and scalable tracking of robotic things. NeurIT, including the source code and the dataset, is open-sourced here: this https URL .
- [1313] arXiv:2404.08964 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Understanding Multimodal Deep Neural Networks: A Concept Selection ViewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The multimodal deep neural networks, represented by CLIP, have generated rich downstream applications owing to their excellent performance, thus making understanding the decision-making process of CLIP an essential research topic. Due to the complex structure and the massive pre-training data, it is often regarded as a black-box model that is too difficult to understand and interpret. Concept-based models map the black-box visual representations extracted by deep neural networks onto a set of human-understandable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. However, these methods involve the datasets labeled with fine-grained attributes by expert knowledge, which incur high costs and introduce excessive human prior knowledge and bias. In this paper, we observe the long-tail distribution of concepts, based on which we propose a two-stage Concept Selection Model (CSM) to mine core concepts without introducing any human priors. The concept greedy rough selection algorithm is applied to extract head concepts, and then the concept mask fine selection method performs the extraction of core concepts. Experiments show that our approach achieves comparable performance to end-to-end black-box models, and human evaluation demonstrates that the concepts discovered by our method are interpretable and comprehensible for humans.
- [1314] arXiv:2404.08978 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Incremental Residual Concept Bottleneck ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Concept Bottleneck Models (CBMs) map the black-box visual representations extracted by deep neural networks onto a set of interpretable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. Multimodal pre-trained models can match visual representations with textual concept embeddings, allowing for obtaining the interpretable concept bottleneck without the expertise concept annotations. Recent research has focused on the concept bank establishment and the high-quality concept selection. However, it is challenging to construct a comprehensive concept bank through humans or large language models, which severely limits the performance of CBMs. In this work, we propose the Incremental Residual Concept Bottleneck Model (Res-CBM) to address the challenge of concept completeness. Specifically, the residual concept bottleneck model employs a set of optimizable vectors to complete missing concepts, then the incremental concept discovery module converts the complemented vectors with unclear meanings into potential concepts in the candidate concept bank. Our approach can be applied to any user-defined concept bank, as a post-hoc processing method to enhance the performance of any CBMs. Furthermore, to measure the descriptive efficiency of CBMs, the Concept Utilization Efficiency (CUE) metric is proposed. Experiments show that the Res-CBM outperforms the current state-of-the-art methods in terms of both accuracy and efficiency and achieves comparable performance to black-box models across multiple datasets.
- [1315] arXiv:2404.08985 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient FinetuningComments: 13 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications, ranging from content generation to interactive entertainment, and artistic creation. However, the diversity of downstream tasks in multitask scenarios presents substantial adaptation challenges for LLMs. While traditional methods often succumb to knowledge confusion on their monolithic dense models, Mixture-of-Experts (MoE) has been emerged as a promising solution with its sparse architecture for effective task decoupling. Inspired by the principles of human cognitive neuroscience, we design a novel framework \texttt{Intuition-MoR1E} that leverages the inherent semantic clustering of instances to mimic the human brain to deal with multitask, offering implicit guidance to router for optimized feature allocation. Moreover, we introduce cutting-edge Rank-1 Experts formulation designed to manage a spectrum of intuitions, demonstrating enhanced parameter efficiency and effectiveness in multitask LLM finetuning. Extensive experiments demonstrate that Intuition-MoR1E achieves superior efficiency and 2.15\% overall accuracy improvement across 14 public datasets against other state-of-the-art baselines.
- [1316] arXiv:2404.08986 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Airship Formations for Animal Motion Capture and Behavior AnalysisComments: 7 pages, 3 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Using UAVs for wildlife observation and motion capture offers manifold advantages for studying animals in the wild, especially grazing herds in open terrain. The aerial perspective allows observation at a scale and depth that is not possible on the ground, offering new insights into group behavior. However, the very nature of wildlife field-studies puts traditional fixed wing and multi-copter systems to their limits: limited flight time, noise and safety aspects affect their efficacy, where lighter than air systems can remain on station for many hours. Nevertheless, airships are challenging from a ground handling perspective as well as from a control point of view, being voluminous and highly affected by wind. In this work, we showcase a system designed to use airship formations to track, follow, and visually record wild horses from multiple angles, including airship design, simulation, control, on board computer vision, autonomous operation and practical aspects of field experiments.
- [1317] arXiv:2404.08990 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: A Fourier-enhanced multi-modal 3D small object optical mark recognition and positioning method for percutaneous abdominal puncture surgical navigationZezhao Guo (1), Yanzhong Guo (2), Zhanfang Zhao (1) ((1) College of information and Engineering, Hebei GEO University, (2) Beijing Yingrui Pioneer Medical Technology Co., Ltd)Comments: 19 pages, 6 figures,Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Navigation for thoracoabdominal puncture surgery is used to locate the needle entry point on the patient's body surface. The traditional reflective ball navigation method is difficult to position the needle entry point on the soft, irregular, smooth chest and abdomen. Due to the lack of clear characteristic points on the body surface using structured light technology, it is difficult to identify and locate arbitrary needle insertion points. Based on the high stability and high accuracy requirements of surgical navigation, this paper proposed a novel method, a muti-modal 3D small object medical marker detection method, which identifies the center of a small single ring as the needle insertion point. Moreover, this novel method leverages Fourier transform enhancement technology to augment the dataset, enrich image details, and enhance the network's capability. The method extracts the Region of Interest (ROI) of the feature image from both enhanced and original images, followed by generating a mask map. Subsequently, the point cloud of the ROI from the depth map is obtained through the registration of ROI point cloud contour fitting. In addition, this method employs Tukey loss for optimal precision. The experimental results show this novel method proposed in this paper not only achieves high-precision and high-stability positioning, but also enables the positioning of any needle insertion point.
- [1318] arXiv:2404.08991 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Business models for the simulation hypothesisSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); General Economics (econ.GN)
Abstract: The simulation hypothesis suggests that we live in a computer simulation. That notion has attracted significant scholarly and popular interest. This article explores the simulation hypothesis from a business perspective. Due to the lack of a name for a universe consistent with the simulation hypothesis, we propose the term simuverse. We argue that if we live in a simulation, there must be a business justification. Therefore, we ask: If we live in a simuverse, what is its business model? We identify and explore business model scenarios, such as simuverse as a project, service, or platform. We also explore business model pathways and risk management issues. The article contributes to the simulation hypothesis literature and is the first to provide a business model perspective on the simulation hypothesis. The article discusses theoretical and practical implications and identifies opportunities for future research related to sustainability, digital transformation, and Artificial Intelligence (AI).
- [1319] arXiv:2404.08995 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Beyond Known Clusters: Probe New Prototypes for Efficient Generalized Class DiscoveryComments: 9 pages, 7 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Generalized Class Discovery (GCD) aims to dynamically assign labels to unlabelled data partially based on knowledge learned from labelled data, where the unlabelled data may come from known or novel classes. The prevailing approach generally involves clustering across all data and learning conceptions by prototypical contrastive learning. However, existing methods largely hinge on the performance of clustering algorithms and are thus subject to their inherent limitations. Firstly, the estimated cluster number is often smaller than the ground truth, making the existing methods suffer from the lack of prototypes for comprehensive conception learning. To address this issue, we propose an adaptive probing mechanism that introduces learnable potential prototypes to expand cluster prototypes (centers). As there is no ground truth for the potential prototype, we develop a self-supervised prototype learning framework to optimize the potential prototype in an end-to-end fashion. Secondly, clustering is computationally intensive, and the conventional strategy of clustering both labelled and unlabelled instances exacerbates this issue. To counteract this inefficiency, we opt to cluster only the unlabelled instances and subsequently expand the cluster prototypes with our introduced potential prototypes to fast explore novel classes. Despite the simplicity of our proposed method, extensive empirical analysis on a wide range of datasets confirms that our method consistently delivers state-of-the-art results. Specifically, our method surpasses the nearest competitor by a significant margin of 9.7% within the Stanford Cars dataset and 12x clustering efficiency within the Herbarium 19 dataset. We will make the code and checkpoints publicly available at this https URL .
- [1320] arXiv:2404.09001 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in HouseholdsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Despite the significant demand for assistive technology among vulnerable groups (e.g., the elderly, children, and the disabled) in daily tasks, research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feelings, such as their opportunity for practice and learning, sense of self-improvement, and self-esteem. Addressing this gap, we define a pivotal and novel challenge Smart Help, which aims to provide proactive yet adaptive support to human agents with diverse disabilities and dynamic goals in various tasks and environments. To establish this challenge, we leverage AI2-THOR to build a new interactive 3D realistic household environment for the Smart Help task. We introduce an innovative opponent modeling module that provides a nuanced understanding of the main agent's capabilities and goals, in order to optimize the assisting agent's helping policy. Rigorous experiments validate the efficacy of our model components and show the superiority of our holistic approach against established baselines. Our findings illustrate the potential of AI-imbued assistive robots in improving the well-being of vulnerable groups.
- [1321] arXiv:2404.09005 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Proof-of-Learning with Incentive SecurityComments: 17 pages, 5 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Abstract: Most concurrent blockchain systems rely heavily on the Proof-of-Work (PoW) or Proof-of-Stake (PoS) mechanisms for decentralized consensus and security assurance. However, the substantial energy expenditure stemming from computationally intensive yet meaningless tasks has raised considerable concerns surrounding traditional PoW approaches, The PoS mechanism, while free of energy consumption, is subject to security and economic issues. Addressing these issues, the paradigm of Proof-of-Useful-Work (PoUW) seeks to employ challenges of practical significance as PoW, thereby imbuing energy consumption with tangible value. While previous efforts in Proof of Learning (PoL) explored the utilization of deep learning model training SGD tasks as PoUW challenges, recent research has revealed its vulnerabilities to adversarial attacks and the theoretical hardness in crafting a byzantine-secure PoL mechanism. In this paper, we introduce the concept of incentive-security that incentivizes rational provers to behave honestly for their best interest, bypassing the existing hardness to design a PoL mechanism with computational efficiency, a provable incentive-security guarantee and controllable difficulty. Particularly, our work is secure against two attacks to the recent work of Jia et al. [2021], and also improves the computational overhead from $\Theta(1)$ to $O(\frac{\log E}{E})$. Furthermore, while most recent research assumes trusted problem providers and verifiers, our design also guarantees frontend incentive-security even when problem providers are untrusted, and verifier incentive-security that bypasses the Verifier's Dilemma. By incorporating ML training into blockchain consensus mechanisms with provable guarantees, our research not only proposes an eco-friendly solution to blockchain systems, but also provides a proposal for a completely decentralized computing power market in the new AI age.
- [1322] arXiv:2404.09016 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Theoretical research on generative diffusion models: an overviewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Generative diffusion models showed high success in many fields with a powerful theoretical background. They convert the data distribution to noise and remove the noise back to obtain a similar distribution. Many existing reviews focused on the specific application areas without concentrating on the research about the algorithm. Unlike them we investigated the theoretical developments of the generative diffusion models. These approaches mainly divide into two: training-based and sampling-based. Awakening to this allowed us a clear and understandable categorization for the researchers who will make new developments in the future.
- [1323] arXiv:2404.09022 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning StrategiesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the surge of ChatGPT,the use of large models has significantly increased,rapidly rising to prominence across the industry and sweeping across the internet. This article is a comprehensive review of fine-tuning methods for large models. This paper investigates the latest technological advancements and the application of advanced methods in aspects such as task-adaptive fine-tuning,domain-adaptive fine-tuning,few-shot learning,knowledge distillation,multi-task learning,parameter-efficient fine-tuning,and dynamic fine-tuning.
- [1324] arXiv:2404.09042 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Improving Personalisation in Valence and Arousal Prediction using Data AugmentationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In the field of emotion recognition and Human-Machine Interaction (HMI), personalised approaches have exhibited their efficacy in capturing individual-specific characteristics and enhancing affective prediction accuracy. However, personalisation techniques often face the challenge of limited data for target individuals. This paper presents our work on an enhanced personalisation strategy, that leverages data augmentation to develop tailored models for continuous valence and arousal prediction. Our proposed approach, Distance Weighting Augmentation (DWA), employs a weighting-based augmentation method that expands a target individual's dataset, leveraging distance metrics to identify similar samples at the segment-level. Experimental results on the MuSe-Personalisation 2023 Challenge dataset demonstrate that our method significantly improves the performance of features sets which have low baseline performance, on the test set. This improvement in poor-performing features comes without sacrificing performance on high-performing features. In particular, our method achieves a maximum combined testing CCC of 0.78, compared to the reported baseline score of 0.76 (reproduced at 0.72). It also achieved a peak arousal and valence scores of 0.81 and 0.76, compared to reproduced baseline scores of 0.76 and 0.67 respectively. Through this work, we make significant contributions to the advancement of personalised affective computing models, enhancing the practicality and adaptability of data-level personalisation in real world contexts.
- [1325] arXiv:2404.09045 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Adapting Mental Health Prediction Tasks for Cross-lingual Learning via Meta-Training and In-context Learning with Large Language ModelSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Timely identification is essential for the efficient handling of mental health illnesses such as depression. However, the current research fails to adequately address the prediction of mental health conditions from social media data in low-resource African languages like Swahili. This study introduces two distinct approaches utilising model-agnostic meta-learning and leveraging large language models (LLMs) to address this gap. Experiments are conducted on three datasets translated to low-resource language and applied to four mental health tasks, which include stress, depression, depression severity and suicidal ideation prediction. we first apply a meta-learning model with self-supervision, which results in improved model initialisation for rapid adaptation and cross-lingual transfer. The results show that our meta-trained model performs significantly better than standard fine-tuning methods, outperforming the baseline fine-tuning in macro F1 score with 18\% and 0.8\% over XLM-R and mBERT. In parallel, we use LLMs' in-context learning capabilities to assess their performance accuracy across the Swahili mental health prediction tasks by analysing different cross-lingual prompting approaches. Our analysis showed that Swahili prompts performed better than cross-lingual prompts but less than English prompts. Our findings show that in-context learning can be achieved through cross-lingual transfer through carefully crafted prompt templates with examples and instructions.
- [1326] arXiv:2404.09051 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking Iterative Stereo Matching from Diffusion Bridge Model PerspectiveComments: tip. arXiv admin note: text overlap with arXiv:2303.06615 by other authorsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recently, iteration-based stereo matching has shown great potential. However, these models optimize the disparity map using RNN variants. The discrete optimization process poses a challenge of information loss, which restricts the level of detail that can be expressed in the generated disparity map. In order to address these issues, we propose a novel training approach that incorporates diffusion models into the iterative optimization process. We designed a Time-based Gated Recurrent Unit (T-GRU) to correlate temporal and disparity outputs. Unlike standard recurrent units, we employ Agent Attention to generate more expressive features. We also designed an attention-based context network to capture a large amount of contextual information. Experiments on several public benchmarks show that we have achieved competitive stereo matching performance. Our model ranks first in the Scene Flow dataset, achieving over a 7% improvement compared to competing methods, and requires only 8 iterations to achieve state-of-the-art results.
- [1327] arXiv:2404.09067 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring Explainability in Video Action RecognitionComments: 6 pages, 10 figures, Accepted to the 3rd Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognition, has been scant. In this work, we take a deeper look at this problem. We begin by revisiting Grad-CAM, one of the popular feature attribution methods for Image Classification, and its extension to Video Action Recognition tasks and examine the method's limitations. To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. As the scalable generation of concepts is still an open problem, we propose a machine-assisted approach to generate spatial and spatiotemporal concepts relevant to Video Action Recognition for testing Video-TCAV. We then establish the importance of temporally-varying concepts by demonstrating the superiority of dynamic spatiotemporal concepts over trivial spatial concepts. In conclusion, we introduce a framework for investigating hypotheses in action recognition and quantitatively testing them, thus advancing research in the explainability of deep neural networks used in video action recognition.
- [1328] arXiv:2404.09077 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph PromptingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: In the field of Question Answering (QA), unifying large language models (LLMs) with external databases has shown great success. However, these methods often fall short in providing the advanced reasoning needed for complex QA tasks. To address these issues, we improve over a novel approach called Knowledge Graph Prompting (KGP), which combines knowledge graphs with a LLM-based agent to improve reasoning and search accuracy. Nevertheless, the original KGP framework necessitates costly fine-tuning with large datasets yet still suffers from LLM hallucination. Therefore, we propose a reasoning-infused LLM agent to enhance this framework. This agent mimics human curiosity to ask follow-up questions to more efficiently navigate the search. This simple modification significantly boosts the LLM performance in QA tasks without the high costs and latency associated with the initial KGP framework. Our ultimate goal is to further develop this approach, leading to more accurate, faster, and cost-effective solutions in the QA domain.
- [1329] arXiv:2404.09091 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Semantic In-Domain Product Identification for Search QueriesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Accurate explicit and implicit product identification in search queries is critical for enhancing user experiences, especially at a company like Adobe which has over 50 products and covers queries across hundreds of tools. In this work, we present a novel approach to training a product classifier from user behavioral data. Our semantic model led to >25% relative improvement in CTR (click through rate) across the deployed surfaces; a >50% decrease in null rate; a 2x increase in the app cards surfaced, which helps drive product visibility.
- [1330] arXiv:2404.09101 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Mixture of Experts Soften the Curse of Dimensionality in Operator LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Abstract: In this paper, we construct a mixture of neural operators (MoNOs) between function spaces whose complexity is distributed over a network of expert neural operators (NOs), with each NO satisfying parameter scaling restrictions. Our main result is a \textit{distributed} universal approximation theorem guaranteeing that any Lipschitz non-linear operator between $L^2([0,1]^d)$ spaces can be approximated uniformly over the Sobolev unit ball therein, to any given $\varepsilon>0$ accuracy, by an MoNO while satisfying the constraint that: each expert NO has a depth, width, and rank of $\mathcal{O}(\varepsilon^{-1})$. Naturally, our result implies that the required number of experts must be large, however, each NO is guaranteed to be small enough to be loadable into the active memory of most computers for reasonable accuracies $\varepsilon$. During our analysis, we also obtain new quantitative expression rates for classical NOs approximating uniformly continuous non-linear operators uniformly on compact subsets of $L^2([0,1]^d)$.
- [1331] arXiv:2404.09105 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EGGS: Edge Guided Gaussian Splatting for Radiance FieldsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Image and Video Processing (eess.IV)
Abstract: The Gaussian splatting methods are getting popular. However, their loss function only contains the $\ell_1$ norm and the structural similarity between the rendered and input images, without considering the edges in these images. It is well-known that the edges in an image provide important information. Therefore, in this paper, we propose an Edge Guided Gaussian Splatting (EGGS) method that leverages the edges in the input images. More specifically, we give the edge region a higher weight than the flat region. With such edge guidance, the resulting Gaussian particles focus more on the edges instead of the flat regions. Moreover, such edge guidance does not crease the computation cost during the training and rendering stage. The experiments confirm that such simple edge-weighted loss function indeed improves about $1\sim2$ dB on several difference data sets. With simply plugging in the edge guidance, the proposed method can improve all Gaussian splatting methods in different scenarios, such as human head modeling, building 3D reconstruction, etc.
- [1332] arXiv:2404.09110 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: ProSAS: An O-RAN Approach to Spectrum Sharing between NR and LTEComments: Accepted for publication at IEEE International Conference on Communications (ICC) 2024Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: The Open Radio Access Network (O-RAN), an industry-driven initiative, utilizes intelligent Radio Access Network (RAN) controllers and open interfaces to facilitate efficient spectrum sharing between LTE and NR RANs. In this paper, we introduce the Proactive Spectrum Adaptation Scheme (ProSAS), a data-driven, O-RAN-compatible spectrum sharing solution. ProSAS is an intelligent radio resource demand prediction and management scheme for intent-driven spectrum management that minimizes surplus or deficit experienced by both RANs. We illustrate the effectiveness of this solution using real-world LTE resource usage data and synthetically generated NR data. Lastly, we discuss a high-level O-RAN-compatible architecture of the proposed solution.
- [1333] arXiv:2404.09114 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Intelligent Chemical Purification Technique Based on Machine LearningComments: 22 pages, 5 Figures, Submitted to Nature Machine IntelligenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
Abstract: We present an innovative of artificial intelligence with column chromatography, aiming to resolve inefficiencies and standardize data collection in chemical separation and purification domain. By developing an automated platform for precise data acquisition and employing advanced machine learning algorithms, we constructed predictive models to forecast key separation parameters, thereby enhancing the efficiency and quality of chromatographic processes. The application of transfer learning allows the model to adapt across various column specifications, broadening its utility. A novel metric, separation probability ($S_p$), quantifies the likelihood of effective compound separation, validated through experimental verification. This study signifies a significant step forward int the application of AI in chemical research, offering a scalable solution to traditional chromatography challenges and providing a foundation for future technological advancements in chemical analysis and purification.
- [1334] arXiv:2404.09123 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Provable Interactive Learning with Hindsight Instruction FeedbackSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Abstract: We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent's response space. We then study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that its regret scales as $\sqrt{T}$ where $T$ is the number of rounds and depends on the intrinsic rank but does not depend on the size of the agent's response space. We provide experiments in two domains showing that LORIL outperforms baselines even when the low-rank assumption is violated.
- [1335] arXiv:2404.09136 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: TLDR at SemEval-2024 Task 2: T5-generated clinical-Language summaries for DeBERTa Report AnalysisJournal-ref: In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 507-516, Mexico City, Mexico. Association for Computational LinguisticsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper introduces novel methodologies for the Natural Language Inference for Clinical Trials (NLI4CT) task. We present TLDR (T5-generated clinical-Language summaries for DeBERTa Report Analysis) which incorporates T5-model generated premise summaries for improved entailment and contradiction analysis in clinical NLI tasks. This approach overcomes the challenges posed by small context windows and lengthy premises, leading to a substantial improvement in Macro F1 scores: a 0.184 increase over truncated premises. Our comprehensive experimental evaluation, including detailed error analysis and ablations, confirms the superiority of TLDR in achieving consistency and faithfulness in predictions against semantically altered inputs.
- [1336] arXiv:2404.09138 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language RepresentationArtur Kiulian , Anton Polishko , Mykola Khandoga , Oryna Chubych , Jack Connor , Raghav Ravishankar , Adarsh ShirawalmathSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency and benchmarking them against other existing models capable of processing Ukrainian language. This endeavor not only aims to mitigate language bias in technology but also promotes inclusivity in the digital realm. Our transparent and reproducible approach encourages further NLP research and development. Additionally, we present the Ukrainian Knowledge and Instruction Dataset (UKID) to aid future efforts in language model fine-tuning. Our research not only advances the field of NLP but also highlights the importance of linguistic diversity in AI, which is crucial for cultural preservation, education, and expanding AI's global utility. Ultimately, we advocate for a future where technology is inclusive, enabling AI to communicate effectively across all languages, especially those currently underrepresented.
- [1337] arXiv:2404.09145 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ToNER: Type-oriented Named Entity Recognition with Generative Language ModelComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, the fine-tuned generative models have been proven more powerful than the previous tagging-based or span-based models on named entity recognition (NER) task. It has also been found that the information related to entities, such as entity types, can prompt a model to achieve NER better. However, it is not easy to determine the entity types indeed existing in the given sentence in advance, and inputting too many potential entity types would distract the model inevitably. To exploit entity types' merit on promoting NER task, in this paper we propose a novel NER framework, namely ToNER based on a generative model. In ToNER, a type matching model is proposed at first to identify the entity types most likely to appear in the sentence. Then, we append a multiple binary classification task to fine-tune the generative model's encoder, so as to generate the refined representation of the input sentence. Moreover, we add an auxiliary task for the model to discover the entity types which further fine-tunes the model to output more accurate results. Our extensive experiments on some NER benchmarks verify the effectiveness of our proposed strategies in ToNER that are oriented towards entity types' exploitation.
- [1338] arXiv:2404.09146 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Fusion-Mamba for Cross-modality Object DetectionWenhao Dong , Haodong Zhu , Shaohui Lin , Xiaoyan Luo , Yunhang Shen , Xuhui Liu , Juan Zhang , Guodong Guo , Baochang ZhangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
- [1339] arXiv:2404.09155 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Mitigating Heterogeneity among Factor Tensors via Lie Group Manifolds for Tensor Decomposition Based Temporal Knowledge Graph EmbeddingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent studies have highlighted the effectiveness of tensor decomposition methods in the Temporal Knowledge Graphs Embedding (TKGE) task. However, we found that inherent heterogeneity among factor tensors in tensor decomposition significantly hinders the tensor fusion process and further limits the performance of link prediction. To overcome this limitation, we introduce a novel method that maps factor tensors onto a unified smooth Lie group manifold to make the distribution of factor tensors approximating homogeneous in tensor decomposition. We provide the theoretical proof of our motivation that homogeneous tensors are more effective than heterogeneous tensors in tensor fusion and approximating the target for tensor decomposition based TKGE methods. The proposed method can be directly integrated into existing tensor decomposition based TKGE methods without introducing extra parameters. Extensive experiments demonstrate the effectiveness of our method in mitigating the heterogeneity and in enhancing the tensor decomposition based TKGE models.
- [1340] arXiv:2404.09158 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: StreakNet-Arch: An Anti-scattering Network-based Architecture for Underwater Carrier LiDAR-Radar ImagingComments: Reduce the number of pages to 13Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we introduce StreakNet-Arch, a novel signal processing architecture designed for Underwater Carrier LiDAR-Radar (UCLR) imaging systems, to address the limitations in scatter suppression and real-time imaging. StreakNet-Arch formulates the signal processing as a real-time, end-to-end binary classification task, enabling real-time image acquisition. To achieve this, we leverage Self-Attention networks and propose a novel Double Branch Cross Attention (DBC-Attention) mechanism that surpasses the performance of traditional methods. Furthermore, we present a method for embedding streak-tube camera images into attention networks, effectively acting as a learned bandpass filter. To facilitate further research, we contribute a publicly available streak-tube camera image dataset. The dataset contains 2,695,168 real-world underwater 3D point cloud data. These advancements significantly improve UCLR capabilities, enhancing its performance and applicability in underwater imaging tasks. The source code and dataset can be found at this https URL .
- [1341] arXiv:2404.09163 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot LearningComments: Accepted to The 37th International Conference on Neural Information Processing Systems (NeurIPS 2023)December 10-16, 2023 - SyntheticData4ML workshop, New Orleans, United States this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM (Exact Match) points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset, and it surpasses the performance of model trained on an English-only dataset by 5.05/6.50 F1/EM points for Hindi and 3.81/3.69 points F1/EM for Spanish on the same dataset. Notably, our approach uses a pre-trained LLM for generation with no fine-tuning (FT), utilizing just a single annotated example in ICL to generate data, providing a cost-effective development process.
- [1342] arXiv:2404.09167 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Survey on Embedding Models for Knowledge Graph and its ApplicationsSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge Graph (KG) is a graph based data structure to represent facts of the world where nodes represent real world entities or abstract concept and edges represent relation between the entities. Graph as representation for knowledge has several drawbacks like data sparsity, computational complexity and manual feature engineering. Knowledge Graph embedding tackles the drawback by representing entities and relation in low dimensional vector space by capturing the semantic relation between them. There are different KG embedding models. Here, we discuss translation based and neural network based embedding models which differ based on semantic property, scoring function and architecture they use. Further, we discuss application of KG in some domains that use deep learning models and leverage social media data.
- [1343] arXiv:2404.09172 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LoopAnimate: Loopable Salient Object AnimationFanyi Wang , Peng Liu , Haotian Hu , Dan Meng , Jingwen Su , Jinjin Xu , Yanhao Zhang , Xiaoming Ren , Zhiwang ZhangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
- [1344] arXiv:2404.09173 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: TransformerFAM: Feedback attention is working memoryComments: 26 pages, 12 figures, 14 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
- [1345] arXiv:2404.09192 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend ModelingSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues to speech. The research community has shown growing interest in the study of the frontend component, recognizing its pivotal role in text-to-speech systems, including Text Normalization (TN), Prosody Boundary Prediction (PBP), and Polyphone Disambiguation (PD). Nonetheless, the limitations posed by insufficient annotated textual data and the reliance on homogeneous text signals significantly undermine the effectiveness of its supervised learning. To evade this obstacle, a novel two-stage TTS frontend prediction pipeline, named TAP-FM, is proposed in this paper. Specifically, during the first learning phase, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Instead of mining homogeneous features in prior pre-training approaches, our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations. Furthermore, a parallelized TTS frontend model is delicately devised to execute TN, PD, and PBP prediction tasks, respectively in the second stage. Finally, extensive experiments illustrate the superiority of our proposed method, achieving state-of-the-art performance.
- [1346] arXiv:2404.09204 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.
- [1347] arXiv:2404.09210 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedDistill: Global Model Distillation for Local Model De-Biasing in Non-IID Federated LearningComments: 13 pages, 9 figures, 5 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Federated Learning (FL) is a novel approach that allows for collaborative machine learning while preserving data privacy by leveraging models trained on decentralized devices. However, FL faces challenges due to non-uniformly distributed (non-iid) data across clients, which impacts model performance and its generalization capabilities. To tackle the non-iid issue, recent efforts have utilized the global model as a teaching mechanism for local models. However, our pilot study shows that their effectiveness is constrained by imbalanced data distribution, which induces biases in local models and leads to a 'local forgetting' phenomenon, where the ability of models to generalize degrades over time, particularly for underrepresented classes. This paper introduces FedDistill, a framework enhancing the knowledge transfer from the global model to local models, focusing on the issue of imbalanced class distribution. Specifically, FedDistill employs group distillation, segmenting classes based on their frequency in local datasets to facilitate a focused distillation process to classes with fewer samples. Additionally, FedDistill dissects the global model into a feature extractor and a classifier. This separation empowers local models with more generalized data representation capabilities and ensures more accurate classification across all classes. FedDistill mitigates the adverse effects of data imbalance, ensuring that local models do not forget underrepresented classes but instead become more adept at recognizing and classifying them accurately. Our comprehensive experiments demonstrate FedDistill's effectiveness, surpassing existing baselines in accuracy and convergence speed across several benchmark datasets.
- [1348] arXiv:2404.09221 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Fast Inference: Exploring and Improving Blockwise Parallel DraftsTaehyeon Kim , Ananda Theertha Suresh , Kishore Papineni , Michael Riley , Sanjiv Kumar , Adrian BentonSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. (2018) as a way to improve inference speed of language models. In this paper, we make two contributions to understanding and improving BPD drafts. We first offer an analysis of the token distributions produced by the BPD prediction heads. Secondly, we use this analysis to inform algorithms to improve BPD inference speed by refining the BPD drafts using small n-gram or neural language models. We empirically show that these refined BPD drafts yield a higher average verified prefix length across tasks.
- [1349] arXiv:2404.09248 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model RolloutsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Reinforcement learning (RL) trains agents to accomplish complex tasks through environmental interaction data, but its capacity is also limited by the scope of the available data. To obtain a knowledgeable agent, a promising approach is to leverage the knowledge from large language models (LLMs). Despite previous studies combining LLMs with RL, seamless integration of the two components remains challenging due to their semantic gap. This paper introduces a novel method, Knowledgeable Agents from Language Model Rollouts (KALM), which extracts knowledge from LLMs in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods. The primary challenge of KALM lies in LLM grounding, as LLMs are inherently limited to textual data, whereas environmental data often comprise numerical vectors unseen to LLMs. To address this, KALM fine-tunes the LLM to perform various tasks based on environmental data, including bidirectional translation between natural language descriptions of skills and their corresponding rollout data. This grounding process enhances the LLM's comprehension of environmental dynamics, enabling it to generate diverse and meaningful imaginary rollouts that reflect novel skills. Initial empirical evaluations on the CLEVR-Robot environment demonstrate that KALM enables agents to complete complex rephrasings of task goals and extend their capabilities to novel tasks requiring unprecedented optimal behaviors. KALM achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods. Furthermore, KALM effectively enables the LLM to comprehend environmental dynamics, resulting in the generation of meaningful imaginary rollouts that reflect novel skills and demonstrate the seamless integration of large language models and reinforcement learning.
- [1350] arXiv:2404.09259 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FedCCL: Federated Dual-Clustered Feature Contrast Under Domain HeterogeneitySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning (FL) facilitates a privacy-preserving neural network training paradigm through collaboration between edge clients and a central server. One significant challenge is that the distributed data is not independently and identically distributed (non-IID), typically including both intra-domain and inter-domain heterogeneity. However, recent research is limited to simply using averaged signals as a form of regularization and only focusing on one aspect of these non-IID challenges. Given these limitations, this paper clarifies these two non-IID challenges and attempts to introduce cluster representation to address them from both local and global perspectives. Specifically, we propose a dual-clustered feature contrast-based FL framework with dual focuses. First, we employ clustering on the local representations of each client, aiming to capture intra-class information based on these local clusters at a high level of granularity. Then, we facilitate cross-client knowledge sharing by pulling the local representation closer to clusters shared by clients with similar semantics while pushing them away from clusters with dissimilar semantics. Second, since the sizes of local clusters belonging to the same class may differ for each client, we further utilize clustering on the global side and conduct averaging to create a consistent global signal for guiding each local training in a contrastive manner. Experimental results on multiple datasets demonstrate that our proposal achieves comparable or superior performance gain under intra-domain and inter-domain heterogeneity.
- [1351] arXiv:2404.09263 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied. Although existing studies have made impressive advancement recently, they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects, resulting in poor model performance. In this paper, we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks, we propose an inter-task feedback mechanism, which transforms the results of one task as guiding masks to assist the other task. Different from existing methods, we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at this https URL .
- [1352] arXiv:2404.09265 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Make Split, not Hijack: Preventing Feature-Space Hijacking Attacks in Split LearningComments: Accepted In Proceedings of the 29th ACM Symposium on Access Control Models and Technologies (SACMAT '24)Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The popularity of Machine Learning (ML) makes the privacy of sensitive data more imperative than ever. Collaborative learning techniques like Split Learning (SL) aim to protect client data while enhancing ML processes. Though promising, SL has been proved to be vulnerable to a plethora of attacks, thus raising concerns about its effectiveness on data privacy. In this work, we introduce a hybrid approach combining SL and Function Secret Sharing (FSS) to ensure client data privacy. The client adds a random mask to the activation map before sending it to the servers. The servers cannot access the original function but instead work with shares generated using FSS. Consequently, during both forward and backward propagation, the servers cannot reconstruct the client's raw data from the activation map. Furthermore, through visual invertibility, we demonstrate that the server is incapable of reconstructing the raw image data from the activation map when using FSS. It enhances privacy by reducing privacy leakage compared to other SL-based approaches where the server can access client input information. Our approach also ensures security against feature space hijacking attack, protecting sensitive information from potential manipulation. Our protocols yield promising results, reducing communication overhead by over 2x and training time by over 7x compared to the same model with FSS, without any SL. Also, we show that our approach achieves >96% accuracy and remains equivalent to the plaintext models.
- [1353] arXiv:2404.09275 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TrafficVLM: A Controllable Visual Language Model for Traffic Video CaptioningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods only focus on locating traffic event segments, which severely lack descriptive details related to the behaviour and context of all the subjects of interest in the events. In this paper, we present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. TrafficVLM models traffic video events at different levels of analysis, both spatially and temporally, and generates long fine-grained descriptions for the vehicle and pedestrian at different phases of the event. We also propose a conditional component for TrafficVLM to control the generation outputs and a multi-task fine-tuning paradigm to enhance TrafficVLM's learning capability. Experiments show that TrafficVLM performs well on both vehicle and overhead camera views. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings. Our code is publicly available at this https URL .
- [1354] arXiv:2404.09286 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Artificial Intelligence enhanced Security Problems in Real-Time Scenario using Blowfish AlgorithmSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: In a nutshell, "the cloud" refers to a collection of interconnected computing resources made possible by an extensive, real-time communication network like the internet. Because of its potential to reduce processing costs, the emerging paradigm of cloud computing has recently attracted a large number of academics. The exponential expansion of cloud computing has made the rapid expansion of cloud services very remarkable. Ensuring the security of personal information in today's interconnected world is no easy task. These days, security is really crucial. Models of security that are relevant to cloud computing include confidentiality, authenticity, accessibility, data integrity, and recovery. Using the Hybrid Encryption this study, we cover all the security issues and leaks in cloud infrastructure.
- [1355] arXiv:2404.09292 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Bridging Data Islands: Geographic Heterogeneity-Aware Federated Learning for Collaborative Remote Sensing Semantic SegmentationComments: 13 pages,9 figures, 4 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Remote sensing semantic segmentation (RSS) is an essential task in Earth Observation missions. Due to data privacy concerns, high-quality remote sensing images with annotations cannot be well shared among institutions, making it difficult to fully utilize RSS data to train a generalized model. Federated Learning (FL), a privacy-preserving collaborative learning technology, is a potential solution. However, the current research on how to effectively apply FL in RSS is still scarce and requires further investigation. Remote sensing images in various institutions often exhibit strong geographical heterogeneity. More specifically, it is reflected in terms of class-distribution heterogeneity and object-appearance heterogeneity. Unfortunately, most existing FL studies show inadequate focus on geographical heterogeneity, thus leading to performance degradation in the global model. Considering the aforementioned issues, we propose a novel Geographic Heterogeneity-Aware Federated Learning (GeoFed) framework to address privacy-preserving RSS. Through Global Feature Extension and Tail Regeneration modules, class-distribution heterogeneity is alleviated. Additionally, we design an Essential Feature Mining strategy to alleviate object-appearance heterogeneity by constructing essential features. Extensive experiments on three datasets (i.e., FBP, CASID, Inria) show that our GeoFed consistently outperforms the current state-of-the-art methods. The code will be available publicly.
- [1356] arXiv:2404.09302 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: High Significant Fault Detection in Azure Core Workload InsightsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Azure Core workload insights have time-series data with different metric units. Faults or Anomalies are observed in these time-series data owing to faults observed with respect to metric name, resources region, dimensions, and its dimension value associated with the data. For Azure Core, an important task is to highlight faults or anomalies to the user on a dashboard that they can perceive easily. The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour. The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model. Hence, our task is to automatically identify 'high significant anomalies' and their associated information for user perception.
- [1357] arXiv:2404.09313 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and AccompanimentZhiqing Hong , Rongjie Huang , Xize Cheng , Yongqi Wang , Ruiqi Li , Fuming You , Zhou Zhao , Zhimeng ZhangSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI)
Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in this https URL .
- [1358] arXiv:2404.09317 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning AcceleratorSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI)
Abstract: As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform large scale RTL-level fault injections to characterize Ethos-U55 against the Automotive Safety Integrity Level D (ASIL-D) resiliency standard commonly used for safety-critical applications such as autonomous vehicles. We show that, under soft errors, all four configurations of the NPU fall short of the required level of resiliency for a variety of neural networks running on the NPU. We show that it is possible to meet the ASIL-D level resiliency without resorting to conventional strategies like Dual Core Lock Step (DCLS) that has an area overhead of 100%. We achieve so through selective protection, where hardware structures are selectively protected (e.g., duplicated, hardened) based on their sensitivity to soft errors and their silicon areas. To identify the optimal configuration that minimizes the area overhead while meeting the ASIL-D standard, the main challenge is the large search space associated with the time-consuming RTL simulation. To address this challenge, we present a statistical analysis tool that is validated against Arm silicon and that allows us to quickly navigate hundreds of billions of fault sites without exhaustive RTL fault injections. We show that by carefully duplicating a small fraction of the functional blocks and hardening the Flops in other blocks meets the ASIL-D safety standard while introducing an area overhead of only 38%.
- [1359] arXiv:2404.09322 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: The intelligent prediction and assessment of financial information risk in the cloud computing modelSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: Cloud computing (cloud computing) is a kind of distributed computing, referring to the network "cloud" will be a huge data calculation and processing program into countless small programs, and then, through the system composed of multiple servers to process and analyze these small programs to get the results and return to the user. This report explores the intersection of cloud computing and financial information processing, identifying risks and challenges faced by financial institutions in adopting cloud technology. It discusses the need for intelligent solutions to enhance data processing efficiency and accuracy while addressing security and privacy concerns. Drawing on regulatory frameworks, the report proposes policy recommendations to mitigate concentration risks associated with cloud computing in the financial industry. By combining intelligent forecasting and evaluation technologies with cloud computing models, the study aims to provide effective solutions for financial data processing and management, facilitating the industry's transition towards digital transformation.
- [1360] arXiv:2404.09326 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision TransformersDiana-Nicoleta Grigore , Mariana-Iuliana Georgescu , Jon Alvarez Justo , Tor Johansen , Andreea Iuliana Ionescu , Radu Tudor IonescuSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Few-shot knowledge distillation recently emerged as a viable approach to harness the knowledge of large-scale pre-trained models, using limited data and computational resources. In this paper, we propose a novel few-shot feature distillation approach for vision transformers. Our approach is based on two key steps. Leveraging the fact that vision transformers have a consistent depth-wise structure, we first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher. Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers. We present comprehensive experiments with supervised and self-supervised transformers as teachers, on five data sets from various domains, including natural, medical and satellite images. The empirical results confirm the superiority of our approach over competitive baselines. Moreover, the ablation results demonstrate the usefulness of each component of the proposed pipeline.
- [1361] arXiv:2404.09331 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: SNN4Agents: A Framework for Developing Energy-Efficient Embodied Spiking Neural Networks for Autonomous AgentsComments: 18 pages, 15 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset, thereby enabling energy-efficient embodied SNN deployments for autonomous agents.
- [1362] arXiv:2404.09336 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Self-Selected Attention Span for Accelerating Large Language Model InferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during inference. We develop a custom CUDA kernel to take advantage of the reduced context to attend to. We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%. Our work presents an end-to-end demonstration showing that training LLMs to self-select their attention spans speeds up autoregressive inference in solving real-world tasks.
- [1363] arXiv:2404.09339 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Practical Tool Usage for Continually Learning LLMsComments: 20 pages, 11 tables, 7 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) show an innate skill for solving language based tasks. But insights have suggested an inability to adjust for information or task-solving skills becoming outdated, as their knowledge, stored directly within their parameters, remains static in time. Tool use helps by offloading work to systems that the LLM can access through an interface, but LLMs that use them still must adapt to nonstationary environments for prolonged use, as new tools can emerge and existing tools can change. Nevertheless, tools require less specialized knowledge, therefore we hypothesize they are better suited for continual learning (CL) as they rely less on parametric memory for solving tasks and instead focus on learning when to apply pre-defined tools. To verify this, we develop a synthetic benchmark and follow this by aggregating existing NLP tasks to form a more realistic testing scenario. While we demonstrate scaling model size is not a solution, regardless of tool usage, continual learning techniques can enable tool LLMs to both adapt faster while forgetting less, highlighting their potential as continual learners.
- [1364] arXiv:2404.09352 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Counteracting Concept Drift by Learning with Future Malware PredictionsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The accuracy of deployed malware-detection classifiers degrades over time due to changes in data distributions and increasing discrepancies between training and testing data. This phenomenon is known as the concept drift. While the concept drift can be caused by various reasons in general, new malicious files are created by malware authors with a clear intention of avoiding detection. The existence of the intention opens a possibility for predicting such future samples. Including predicted samples in training data should consequently increase the accuracy of the classifiers on new testing data.
We compare two methods for predicting future samples: (1) adversarial training and (2) generative adversarial networks (GANs). The first method explicitly seeks for adversarial examples against the classifier that are then used as a part of training data. Similarly, GANs also generate synthetic training data. We use GANs to learn changes in data distributions within different time periods of training data and then apply these changes to generate samples that could be in testing data. We compare these prediction methods on two different datasets: (1) Ember public dataset and (2) the internal dataset of files incoming to Avast. We show that while adversarial training yields more robust classifiers, this method is not a good predictor of future malware in general. This is in contrast with previously reported positive results in different domains (including natural language processing and spam detection). On the other hand, we show that GANs can be successfully used as predictors of future malware. We specifically examine malware families that exhibit significant changes in their data distributions over time and the experimental results confirm that GAN-based predictions can significantly improve the accuracy of the classifier on new, previously unseen data. - [1365] arXiv:2404.09356 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: LLeMpower: Understanding Disparities in the Control and Access of Large Language ModelsComments: 11 total pages, 7 page text, 4 page references, 3 figures (with subfigures), 1 tableSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
Abstract: Large Language Models (LLMs) are a powerful technology that augment human skill to create new opportunities, akin to the development of steam engines and the internet. However, LLMs come with a high cost. They require significant computing resources and energy to train and serve. Inequity in their control and access has led to concentration of ownership and power to a small collection of corporations. In our study, we collect training and inference requirements for various LLMs. We then analyze the economic strengths of nations and organizations in the context of developing and serving these models. Additionally, we also look at whether individuals around the world can access and use this emerging technology. We compare and contrast these groups to show that these technologies are monopolized by a surprisingly few entities. We conclude with a qualitative study on the ethical implications of our findings and discuss future directions towards equity in LLM access.
- [1366] arXiv:2404.09359 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring Feedback Generation in Automated Skeletal Movement Assessment: A Comprehensive OverviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The application of machine-learning solutions to movement assessment from skeleton videos has attracted significant research attention in recent years. This advancement has made rehabilitation at home more accessible, utilizing movement assessment algorithms that can operate on affordable equipment for human pose detection and analysis from 2D or 3D videos. While the primary objective of automatic assessment tasks is to score movements, the automatic generation of feedback highlighting key movement issues has the potential to significantly enhance and accelerate the rehabilitation process. While numerous research works exist in the field of automatic movement assessment, only a handful address feedback generation. In this study, we explain the types of feedback that can be generated, review existing solutions for automatic feedback generation, and discuss future research directions. To our knowledge, this is the first comprehensive review of feedback generation in skeletal movement assessment.
- [1367] arXiv:2404.09366 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Understanding the Role of Temperature in Diverse Question Generation by GPT-4Arav Agarwal , Karthik Mittal , Aidan Doyle , Pragnya Sridhar , Zipiao Wan , Jacob Arthur Doughty , Jaromir Savelka , Majd SakrSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We conduct a preliminary study of the effect of GPT's temperature parameter on the diversity of GPT4-generated questions. We find that using higher temperature values leads to significantly higher diversity, with different temperatures exposing different types of similarity between generated sets of questions. We also demonstrate that diverse question generation is especially difficult for questions targeting lower levels of Bloom's Taxonomy.
- [1368] arXiv:2404.09384 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Tasks People Prompt: A Taxonomy of LLM Downstream Tasks in Software Verification and Falsification ApproachesVíctor A. Braberman , Flavia Bonomo-Braberman , Yiannis Charalambous , Juan G. Colonna , Lucas C. Cordeiro , Rosiane de FreitasSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Prompting has become one of the main approaches to leverage emergent capabilities of Large Language Models [Brown et al. NeurIPS 2020, Wei et al. TMLR 2022, Wei et al. NeurIPS 2022]. During the last year, researchers and practitioners have been playing with prompts to see how to make the most of LLMs. By homogeneously dissecting 80 papers, we investigate in deep how software testing and verification research communities have been abstractly architecting their LLM-enabled solutions. More precisely, first, we want to validate whether downstream tasks are an adequate concept to convey the blueprint of prompt-based solutions. We also aim at identifying number and nature of such tasks in solutions. For such goal, we develop a novel downstream task taxonomy that enables pinpointing some engineering patterns in a rather varied spectrum of Software Engineering problems that encompasses testing, fuzzing, debugging, vulnerability detection, static analysis and program verification approaches.
- [1369] arXiv:2404.09387 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: RankCLIP: Ranking-Consistent Language-Image PretrainingComments: 10 pages, 3 figures, 6 tables. Code and model checkpoints are available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Among the ever-evolving development of vision-language models, contrastive language-image pretraining (CLIP) has set new benchmarks in many downstream tasks such as zero-shot classifications by leveraging self-supervised contrastive learning on large amounts of text-image pairs. However, its dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the enhanced capability of RankCLIP to effectively improve performance across various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the potential of RankCLIP in further advancing vision-language pretraining.
- [1370] arXiv:2404.09391 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Privacy at a Price: Exploring its Dual Impact on AI FairnessSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Abstract: The worldwide adoption of machine learning (ML) and deep learning models, particularly in critical sectors, such as healthcare and finance, presents substantial challenges in maintaining individual privacy and fairness. These two elements are vital to a trustworthy environment for learning systems. While numerous studies have concentrated on protecting individual privacy through differential privacy (DP) mechanisms, emerging research indicates that differential privacy in machine learning models can unequally impact separate demographic subgroups regarding prediction accuracy. This leads to a fairness concern, and manifests as biased performance. Although the prevailing view is that enhancing privacy intensifies fairness disparities, a smaller, yet significant, subset of research suggests the opposite view. In this article, with extensive evaluation results, we demonstrate that the impact of differential privacy on fairness is not monotonous. Instead, we observe that the accuracy disparity initially grows as more DP noise (enhanced privacy) is added to the ML process, but subsequently diminishes at higher privacy levels with even more noise. Moreover, implementing gradient clipping in the differentially private stochastic gradient descent ML method can mitigate the negative impact of DP noise on fairness. This mitigation is achieved by moderating the disparity growth through a lower clipping threshold.
- [1371] arXiv:2404.09401 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion ModelsComments: updated referencesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Diffusion Models (DMs) have shown remarkable capabilities in various image-generation tasks. However, there are growing concerns that DMs could be used to imitate unauthorized creations and thus raise copyright issues. To address this issue, we propose a novel framework that embeds personal watermarks in the generation of adversarial examples. Such examples can force DMs to generate images with visible watermarks and prevent DMs from imitating unauthorized images. We construct a generator based on conditional adversarial networks and design three losses (adversarial loss, GAN loss, and perturbation loss) to generate adversarial examples that have subtle perturbation but can effectively attack DMs to prevent copyright violations. Training a generator for a personal watermark by our method only requires 5-10 samples within 2-3 minutes, and once the generator is trained, it can generate adversarial examples with that watermark significantly fast (0.2s per image). We conduct extensive experiments in various conditional image-generation scenarios. Compared to existing methods that generate images with chaotic textures, our method adds visible watermarks on the generated images, which is a more straightforward way to indicate copyright violations. We also observe that our adversarial examples exhibit good transferability across unknown generative models. Therefore, this work provides a simple yet powerful way to protect copyright from DM-based imitation.
- [1372] arXiv:2404.09402 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Neural McKean-Vlasov Processes: Distributional Dependence in Diffusion ProcessesComments: Appears in AISTATS 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: McKean-Vlasov stochastic differential equations (MV-SDEs) provide a mathematical description of the behavior of an infinite number of interacting particles by imposing a dependence on the particle density. As such, we study the influence of explicitly including distributional information in the parameterization of the SDE. We propose a series of semi-parametric methods for representing MV-SDEs, and corresponding estimators for inferring parameters from data based on the properties of the MV-SDE. We analyze the characteristics of the different architectures and estimators, and consider their applicability in relevant machine learning problems. We empirically compare the performance of the different architectures and estimators on real and synthetic datasets for time series and probabilistic modeling. The results suggest that explicitly including distributional dependence in the parameterization of the SDE is effective in modeling temporal data with interaction under an exchangeability assumption while maintaining strong performance for standard Itô-SDEs due to the richer class of probability flows associated with MV-SDEs.
- [1373] arXiv:2404.09405 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Few-shot Name Entity Recognition on StackOverflowComments: 5 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: StackOverflow, with its vast question repository and limited labeled examples, raise an annotation challenge for us. We address this gap by proposing RoBERTa+MAML, a few-shot named entity recognition (NER) method leveraging meta-learning. Our approach, evaluated on the StackOverflow NER corpus (27 entity types), achieves a 5% F1 score improvement over the baseline. We improved the results further domain-specific phrase processing enhance results.
- [1374] arXiv:2404.09432 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: The 8th AI City ChallengeShuo Wang , David C. Anastasiu , Zheng Tang , Ming-Ching Chang , Yue Yao , Liang Zheng , Mohammed Shaiqur Rahman , Meenakshi S. Arya , Anuj Sharma , Pranamesh Chakraborty , Sanjita Prajapati , Quan Kong , Norimasa Kobori , Munkhjargal Gochoo , Munkh-Erdene Otgonbold , Fady Alnajjar , Ganzorig Batnasan , Ping-Yang Chen , Jun-Wei Hsieh , Xunlei Wu , Sameer Satish Pusegaonkar , Yizhou Wang , Sujit Biswas , Rama ChellappaComments: Summary of the 8th AI City Challenge Workshop in conjunction with CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
- [1375] arXiv:2404.09445 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Exploring Text-to-Motion Generation with Human PreferenceJenny Sheng , Matthieu Lin , Andrew Zhao , Kevin Pruvost , Yu-Hui Wen , Yangguang Li , Gao Huang , Yong-Jin LiuComments: Accepted to CVPR 2024 HuMoGen WorkshopSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This paper presents an exploration of preference learning in text-to-motion generation. We find that current improvements in text-to-motion generation still rely on datasets requiring expert labelers with motion capture systems. Instead, learning from human preference data does not require motion capture systems; a labeler with no expertise simply compares two generated motions. This is particularly efficient because evaluating the model's output is easier than gathering the motion that performs a desired task (e.g. backflip). To pioneer the exploration of this paradigm, we annotate 3,528 preference pairs generated by MotionGPT, marking the first effort to investigate various algorithms for learning from preference data. In particular, our exploration highlights important design choices when using preference data. Additionally, our experimental results show that preference learning has the potential to greatly improve current text-to-motion generative models. Our code and dataset are publicly available at this https URL }{ this https URL to further facilitate research in this area.
- [1376] arXiv:2404.09462 (cross-list from q-fin.CP) [ pdf , ps , html , other ]
-
Title: Experimental Analysis of Deep Hedging Using Artificial Market Simulations for Underlying Asset SimulatorsComments: 9 pagesSubjects: Computational Finance (q-fin.CP) ; Artificial Intelligence (cs.AI)
Abstract: Derivative hedging and pricing are important and continuously studied topics in financial markets. Recently, deep hedging has been proposed as a promising approach that uses deep learning to approximate the optimal hedging strategy and can handle incomplete markets. However, deep hedging usually requires underlying asset simulations, and it is challenging to select the best model for such simulations. This study proposes a new approach using artificial market simulations for underlying asset simulations in deep hedging. Artificial market simulations can replicate the stylized facts of financial markets, and they seem to be a promising approach for deep hedging. We investigate the effectiveness of the proposed approach by comparing its results with those of the traditional approach, which uses mathematical finance models such as Brownian motion and Heston models for underlying asset simulations. The results show that the proposed approach can achieve almost the same level of performance as the traditional approach without mathematical finance models. Finally, we also reveal that the proposed approach has some limitations in terms of performance under certain conditions.
- [1377] arXiv:2404.09465 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AIComments: Accepted by CVPR 2024, 18 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: this http URL .
- [1378] arXiv:2404.09470 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: LatticeML: A data-driven application for predicting the effective Young Modulus of high temperature graph based architected materialsComments: 32 pages, 11 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Optimization and Control (math.OC); Applied Physics (physics.app-ph)
Abstract: Architected materials with their unique topology and geometry offer the potential to modify physical and mechanical properties. Machine learning can accelerate the design and optimization of these materials by identifying optimal designs and forecasting performance. This work presents LatticeML, a data-driven application for predicting the effective Young's Modulus of high-temperature graph-based architected materials. The study considers eleven graph-based lattice structures with two high-temperature alloys, Ti-6Al-4V and Inconel 625. Finite element simulations were used to compute the effective Young's Modulus of the 2x2x2 unit cell configurations. A machine learning framework was developed to predict Young's Modulus, involving data collection, preprocessing, implementation of regression models, and deployment of the best-performing model. Five supervised learning algorithms were evaluated, with the XGBoost Regressor achieving the highest accuracy (MSE = 2.7993, MAE = 1.1521, R-squared = 0.9875). The application uses the Streamlit framework to create an interactive web interface, allowing users to input material and geometric parameters and obtain predicted Young's Modulus values.
- [1379] arXiv:2404.09475 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo LabelComments: 15 pagesJournal-ref: Engineering Applications of Artificial Intelligence, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Weakly-supervised learning approaches have gained significant attention due to their ability to reduce the effort required for human annotations in training neural networks. This paper investigates a framework for weakly-supervised object localization, which aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels. The proposed framework consists of a shared feature extractor, a classifier, and a localizer. The localizer predicts pixel-level class probabilities, while the classifier predicts the object class at the image level. Since image-level class labels are insufficient for training the localizer, weakly-supervised object localization methods often encounter challenges in accurately localizing the entire object region. To address this issue, the proposed method incorporates adversarial erasing and pseudo labels to improve localization accuracy. Specifically, novel losses are designed to utilize adversarially erased foreground features and adversarially erased feature maps, reducing dependence on the most discriminative region. Additionally, the proposed method employs pseudo labels to suppress activation values in the background while increasing them in the foreground. The proposed method is applied to two backbone networks (MobileNetV1 and InceptionV3) and is evaluated on three publicly available datasets (ILSVRC-2012, CUB-200-2011, and PASCAL VOC 2012). The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods across all evaluated metrics.
- [1380] arXiv:2404.09480 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mitigating Hallucination in Abstractive Summarization with Domain-Conditional Mutual InformationComments: Accepted by Findings of NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: A primary challenge in abstractive summarization is hallucination -- the phenomenon where a model generates plausible text that is absent in the source text. We hypothesize that the domain (or topic) of the source text triggers the model to generate text that is highly probable in the domain, neglecting the details of the source text. To alleviate this model bias, we introduce a decoding strategy based on domain-conditional pointwise mutual information. This strategy adjusts the generation probability of each token by comparing it with the token's marginal probability within the domain of the source text. According to evaluation on the XSUM dataset, our method demonstrates improvement in terms of faithfulness and source relevance. The code is publicly available at \url{ this https URL }.
- [1381] arXiv:2404.09516 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: State Space Model for New-Generation Network Alternative to Transformers: A SurveyXiao Wang , Shiao Wang , Yuhe Ding , Yuehang Li , Wentao Wu , Yao Rong , Weizhe Kong , Ju Huang , Shihao Li , Haoxiang Yang , Ziwen Wang , Bo Jiang , Chenglong Li , Yaowei Wang , Yonghong Tian , Jin TangComments: The First review of State Space Model (SSM)/Mamba and their applications in artificial intelligence, 33 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract: In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: this https URL .
- [1382] arXiv:2404.09521 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement LearningComments: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this work, we address the challenge of zero-shot generalization (ZSG) in Reinforcement Learning (RL), where agents must adapt to entirely novel environments without additional training. We argue that understanding and utilizing contextual cues, such as the gravity level of the environment, is critical for robust generalization, and we propose to integrate the learning of context representations directly with policy learning. Our algorithm demonstrates improved generalization on various simulated domains, outperforming prior context-learning techniques in zero-shot settings. By jointly learning policy and context, our method acquires behavior-specific context representations, enabling adaptation to unseen environments and marks progress towards reinforcement learning systems that generalize across diverse real-world tasks. Our code and experiments are available at this https URL .
- [1383] arXiv:2404.09529 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language ModelsComments: 18 pages, code in this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly support longer context lengths, potentially up to 10 million tokens, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. To avoid redundant computation on pad tokens, prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm. It then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches for multiple prompts within a single sequence. On standard curated dataset containing prompts with varying lengths, we obtain a significant speed and memory efficiency improvements as compared to the default padding-based prefilling computation within Huggingface across a range of base model configurations and inference serving scenarios.
- [1384] arXiv:2404.09530 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and GeneralizationAvinash Anand , Raj Jaiswal , Mohit Gupta , Siddhesh S Bangar , Pijush Bhuyan , Naman Lal , Rajeev Singh , Ritika Jha , Rajiv Ratn Shah , Shin'ichi SatohComments: 8 pages, 6 figures, MMAsia 2023 Proceedings of the 5th ACM International Conference on Multimedia in AsiaJournal-ref: In Proceedings of the 5th ACM International Conference on Multimedia in Asia 2023. Association for Computing Machinery, NY, USA, Article 74, pp. 1-6Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.
- [1385] arXiv:2404.09533 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information FusionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT, despite increasing image noise and potentially affecting diagnostic accuracy. To address this, advanced deep learning-based LDCT denoising algorithms have been developed, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture. This architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections. However, current methods often overlook enhancements to the Unet architecture itself, focusing instead on optimizing encoder and decoder structures. This approach can be problematic due to the significant differences in feature map characteristics between the encoder and decoder, where simple fusion strategies may not effectively reconstruct this http URL this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections to improve feature integration. WiTUnet also incorporates a windowed Transformer structure to process images in smaller, non-overlapping segments, reducing computational load. Additionally, the integration of a Local Image Perception Enhancement (LiPe) module in both the encoder and decoder replaces the standard multi-layer perceptron (MLP) in Transformers, enhancing local feature capture and representation. Through extensive experimental comparisons, WiTUnet has demonstrated superior performance over existing methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), significantly improving noise removal and image quality.
- [1386] arXiv:2404.09536 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Beyond Noise: Privacy-Preserving Decentralized Learning with Virtual NodesSayan Biswas , Mathieu Even , Anne-Marie Kermarrec , Laurent Massoulie , Rafael Pires , Rishi Sharma , Martijn de VosSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Decentralized learning (DL) enables collaborative learning without a server and without training data leaving the users' devices. However, the models shared in DL can still be used to infer training data. Conventional privacy defenses such as differential privacy and secure aggregation fall short in effectively safeguarding user privacy in DL. We introduce Shatter, a novel DL approach in which nodes create virtual nodes (VNs) to disseminate chunks of their full model on their behalf. This enhances privacy by (i) preventing attackers from collecting full models from other nodes, and (ii) hiding the identity of the original node that produced a given model chunk. We theoretically prove the convergence of Shatter and provide a formal analysis demonstrating how Shatter reduces the efficacy of attacks compared to when exchanging full models between participating nodes. We evaluate the convergence and attack resilience of Shatter with existing DL algorithms, with heterogeneous datasets, and against three standard privacy attacks, including gradient inversion. Our evaluation shows that Shatter not only renders these privacy attacks infeasible when each node operates 16 VNs but also exhibits a positive impact on model convergence compared to standard DL. This enhanced privacy comes with a manageable increase in communication volume.
- [1387] arXiv:2404.09537 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Machine Learning Techniques for Python Source Code Vulnerability DetectionSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Software vulnerabilities are a fundamental reason for the prevalence of cyber attacks and their identification is a crucial yet challenging problem in cyber security. In this paper, we apply and compare different machine learning algorithms for source code vulnerability detection specifically for Python programming language. Our experimental evaluation demonstrates that our Bidirectional Long Short-Term Memory (BiLSTM) model achieves a remarkable performance (average Accuracy = 98.6%, average F-Score = 94.7%, average Precision = 96.2%, average Recall = 93.3%, average ROC = 99.3%), thereby, establishing a new benchmark for vulnerability detection in Python source code.
- [1388] arXiv:2404.09544 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GNNavigator: Towards Adaptive Training of Graph Neural Networks via Automatic Guideline ExplorationComments: Accepted by DAC'24Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph Neural Networks (GNNs) succeed significantly in many applications recently. However, balancing GNNs training runtime cost, memory consumption, and attainable accuracy for various applications is non-trivial. Previous training methodologies suffer from inferior adaptability and lack a unified training optimization solution. To address the problem, this work proposes GNNavigator, an adaptive GNN training configuration optimization framework. GNNavigator meets diverse GNN application requirements due to our unified software-hardware co-abstraction, proposed GNNs training performance model, and practical design space exploration solution. Experimental results show that GNNavigator can achieve up to 3.1x speedup and 44.9% peak memory reduction with comparable accuracy to state-of-the-art approaches.
- [1389] arXiv:2404.09557 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Characterization and Mitigation of Insufficiencies in Automated Driving SystemsYuting Fu , Jochen Seemann , Caspar Hanselaar , Tim Beurskens , Andrei Terechko , Emilia Silvas , Maurice HeemelsComments: Published at the 27th International Technical Conference on the Enhanced Safety of Vehicles (ESV), Apr 2023, Yokohama, Japan. Original publication this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract: Automated Driving (AD) systems have the potential to increase safety, comfort and energy efficiency. Recently, major automotive companies have started testing and validating AD systems (ADS) on public roads. Nevertheless, the commercial deployment and wide adoption of ADS have been moderate, partially due to system functional insufficiencies (FI) that undermine passenger safety and lead to hazardous situations on the road. FIs are defined in ISO 21448 Safety Of The Intended Functionality (SOTIF). FIs are insufficiencies in sensors, actuators and algorithm implementations, including neural networks and probabilistic calculations. Examples of FIs in ADS include inaccurate ego-vehicle localization on the road, incorrect prediction of a cyclist maneuver, unreliable detection of a pedestrian, etc.
The main goal of our study is to formulate a generic architectural design pattern, which is compatible with existing methods and ADS, to improve FI mitigation and enable faster commercial deployment of ADS. First, we studied the 2021 autonomous vehicles disengagement reports published by the California Department of Motor Vehicles (DMV). The data clearly show that disengagements are five times more often caused by FIs rather than by system faults. We then made a comprehensive list of insufficiencies and their characteristics by analyzing over 10 hours of publicly available road test videos. In particular, we identified insufficiency types in four major categories: world model, motion plan, traffic rule, and operational design domain. The insufficiency characterization helps making the SOTIF analyses of triggering conditions more systematic and comprehensive.
Based on our FI characterization, simulation experiments and literature survey, we define a novel generic architectural design pattern Daruma to dynamically select the channel that is least likely to have a FI at the moment. - [1390] arXiv:2404.09559 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Joint Contrastive Learning with Feature Alignment for Cross-Corpus EEG-based Emotion RecognitionSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: The integration of human emotions into multimedia applications shows great potential for enriching user experiences and enhancing engagement across various digital platforms. Unlike traditional methods such as questionnaires, facial expressions, and voice analysis, brain signals offer a more direct and objective understanding of emotional states. However, in the field of electroencephalography (EEG)-based emotion recognition, previous studies have primarily concentrated on training and testing EEG models within a single dataset, overlooking the variability across different datasets. This oversight leads to significant performance degradation when applying EEG models to cross-corpus scenarios. In this study, we propose a novel Joint Contrastive learning framework with Feature Alignment (JCFA) to address cross-corpus EEG-based emotion recognition. The JCFA model operates in two main stages. In the pre-training stage, a joint domain contrastive learning strategy is introduced to characterize generalizable time-frequency representations of EEG signals, without the use of labeled data. It extracts robust time-based and frequency-based embeddings for each EEG sample, and then aligns them within a shared latent time-frequency space. In the fine-tuning stage, JCFA is refined in conjunction with downstream tasks, where the structural connections among brain electrodes are considered. The model capability could be further enhanced for the application in emotion detection and interpretation. Extensive experimental results on two well-recognized emotional datasets show that the proposed JCFA model achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy increase of 4.09% in cross-corpus EEG-based emotion recognition tasks.
- [1391] arXiv:2404.09562 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: {\sigma}-GPTs: A New Approach to Autoregressive ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample which offers key advantageous properties. It allows for the sampling of and conditioning on arbitrary subsets of tokens, and it also allows sampling in one shot multiple tokens dynamically according to a rejection strategy, leading to a sub-linear number of model evaluations. We evaluate our method across various domains, including language modeling, path-solving, and aircraft vertical rate prediction, decreasing the number of steps required for generation by an order of magnitude.
- [1392] arXiv:2404.09565 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reliability Estimation of News Media Sources: Birds of a Feather Flock TogetherComments: Accepted to NAACL 2024 Main ConferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Evaluating the reliability of news sources is a routine task for journalists and organizations committed to acquiring and disseminating accurate information. Recent research has shown that predicting sources' reliability represents an important first-prior step in addressing additional challenges such as fake news detection and fact-checking. In this paper, we introduce a novel approach for source reliability estimation that leverages reinforcement learning strategies for estimating the reliability degree of news sources. Contrary to previous research, our proposed approach models the problem as the estimation of a reliability degree, and not a reliability label, based on how all the news media sources interact with each other on the Web. We validated the effectiveness of our method on a news media reliability dataset that is an order of magnitude larger than comparable existing datasets. Results show that the estimated reliability degrees strongly correlates with journalists-provided scores (Spearman=0.80) and can effectively predict reliability labels (macro-avg. F$_1$ score=81.05). We release our implementation and dataset, aiming to provide a valuable resource for the NLP community working on information verification.
- [1393] arXiv:2404.09574 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Predicting and Analyzing Pedestrian Crossing Behavior at Unsignalized CrossingsChi Zhang (1), Janis Sprenger (2), Zhongjun Ni (3), Christian Berger (1) ((1) Department of Computer Science and Engineering, University of Gothenburg, Sweden, (2) German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus, Germany, (3) Department of Science and Technology, Linköping University, Campus Norrköping, Sweden)Comments: 8 pages, 10 figures, 4 tables. Accepted in 2024 IEEE Intelligent Vehicles Symposium (IV)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Understanding and predicting pedestrian crossing behavior is essential for enhancing automated driving and improving driving safety. Predicting gap selection behavior and the use of zebra crossing enables driving systems to proactively respond and prevent potential conflicts. This task is particularly challenging at unsignalized crossings due to the ambiguous right of way, requiring pedestrians to constantly interact with vehicles and other pedestrians. This study addresses these challenges by utilizing simulator data to investigate scenarios involving multiple vehicles and pedestrians. We propose and evaluate machine learning models to predict gap selection in non-zebra scenarios and zebra crossing usage in zebra scenarios. We investigate and discuss how pedestrians' behaviors are influenced by various factors, including pedestrian waiting time, walking speed, the number of unused gaps, the largest missed gap, and the influence of other pedestrians. This research contributes to the evolution of intelligent vehicles by providing predictive models and valuable insights into pedestrian crossing behavior.
- [1394] arXiv:2404.09576 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Large language models and linguistic intentionalitySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Do large language models like Chat-GPT or LLaMa meaningfully use the words they produce? Or are they merely clever prediction machines, simulating language use by producing statistically plausible text? There have already been some initial attempts to answer this question by showing that these models meet the criteria for entering meaningful states according to metasemantic theories of mental content. In this paper, I will argue for a different approach - that we should instead consider whether language models meet the criteria given by our best metasemantic theories of linguistic content. In that vein, I will illustrate how this can be done by applying two such theories to the case of language models: Gareth Evans' (1982) account of naming practices and Ruth Millikan's (1984, 2004, 2005) teleosemantics. In doing so, I will argue that it is a mistake to think that the failure of LLMs to meet plausible conditions for mental intentionality thereby renders their outputs meaningless, and that a distinguishing feature of linguistic intentionality - dependency on a pre-existing linguistic system - allows for the plausible result LLM outputs are meaningful.
- [1395] arXiv:2404.09577 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Transformers, Contextualism, and PolysemySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The transformer architecture, introduced by Vaswani et al. (2017), is at the heart of the remarkable recent progress in the development of language models, including famous chatbots such as Chat-gpt and Bard. In this paper, I argue that we an extract from the way the transformer architecture works a picture of the relationship between context and meaning. I call this the transformer picture, and I argue that it is a novel with regard to two related philosophical debates: the contextualism debate regarding the extent of context-sensitivity across natural language, and the polysemy debate regarding how polysemy should be captured within an account of word meaning. Although much of the paper merely tries to position the transformer picture with respect to these two debates, I will also begin to make the case for the transformer picture.
- [1396] arXiv:2404.09579 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Modelling LanguageSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper argues that large language models have a valuable scientific role to play in serving as scientific models of a language. Linguistic study should not only be concerned with the cognitive processes behind linguistic competence, but also with language understood as an external, social entity. Once this is recognized, the value of large language models as scientific models becomes clear. This paper defends this position against a number of arguments to the effect that language models provide no linguistic insight. It also draws upon recent work in philosophy of science to show how large language models could serve as scientific models.
- [1397] arXiv:2404.09595 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Building Semantic Communication System via Molecules: An End-to-End Training ApproachSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: The concept of semantic communication provides a novel approach for applications in scenarios with limited communication resources. In this paper, we propose an end-to-end (E2E) semantic molecular communication system, aiming to enhance the efficiency of molecular communication systems by reducing the transmitted information. Specifically, following the joint source channel coding paradigm, the network is designed to encode the task-relevant information into the concentration of the information molecules, which is robust to the degradation of the molecular communication channel. Furthermore, we propose a channel network to enable the E2E learning over the non-differentiable molecular channel. Experimental results demonstrate the superior performance of the semantic molecular communication system over the conventional methods in classification tasks.
- [1398] arXiv:2404.09601 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias SuppressionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Deep Neural Networks are prone to learning and relying on spurious correlations in the training data, which, for high-risk applications, can have fatal consequences. Various approaches to suppress model reliance on harmful features have been proposed that can be applied post-hoc without additional training. Whereas those methods can be applied with efficiency, they also tend to harm model performance by globally shifting the distribution of latent features. To mitigate unintended overcorrection of model behavior, we propose a reactive approach conditioned on model-derived knowledge and eXplainable Artificial Intelligence (XAI) insights. While the reactive approach can be applied to many post-hoc methods, we demonstrate the incorporation of reactivity in particular for P-ClArC (Projective Class Artifact Compensation), introducing a new method called R-ClArC (Reactive Class Artifact Compensation). Through rigorous experiments in controlled settings (FunnyBirds) and with a real-world dataset (ISIC2019), we show that introducing reactivity can minimize the detrimental effect of the applied correction while simultaneously ensuring low reliance on spurious features.
- [1399] arXiv:2404.09606 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Self-feedback Knowledge Elicitation Approach for Chemical Reaction PredictionsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Abstract: The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
- [1400] arXiv:2404.09610 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LoRA Dropout as a Sparsity Regularizer for Overfitting ControlSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Parameter-efficient fine-tuning methods, represented by LoRA, play an essential role in adapting large-scale pre-trained models to downstream tasks. However, fine-tuning LoRA-series models also faces the risk of overfitting on the training dataset, and yet there's still a lack of theoretical guidance and practical mechanism to control overfitting on LoRA-based PEFT methods. In this paper, we propose a LoRA Dropout mechanism for the LoRA-based methods by introducing random noises to the learnable low-rank matrices and increasing parameter sparsity. We then demonstrate the theoretical mechanism of our LoRA Dropout mechanism from the perspective of sparsity regularization by providing a generalization error bound under this framework. Theoretical results show that appropriate sparsity would help tighten the gap between empirical and generalization risks and thereby control overfitting. Furthermore, based on the LoRA Dropout framework, we introduce a test-time ensemble strategy and provide theoretical evidence demonstrating that the ensemble method can further compress the error bound, and lead to better performance during inference time. Extensive experiments on various NLP tasks provide practical validations of the effectiveness of our LoRA Dropout framework in improving model accuracy and calibration.
- [1401] arXiv:2404.09613 (cross-list from cs.ET) [ pdf , ps , html , other ]
-
Title: Efficient and accurate neural field reconstruction using resistive memoryYifei Yu , Shaocong Wang , Woyu Zhang , Xinyuan Zhang , Xiuzhe Wu , Yangu He , Jichang Yang , Yue Zhang , Ning Lin , Bo Wang , Xi Chen , Songqi Wang , Xumeng Zhang , Xiaojuan Qi , Zhongrui Wang , Dashan Shang , Qi Liu , Kwang-Ting Cheng , Ming LiuSubjects: Emerging Technologies (cs.ET) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
- [1402] arXiv:2404.09619 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and BenchmarkZhaokun Zhou , Qiulin Wang , Bin Lin , Yiwei Su , Rui Chen , Xin Tao , Amin Zheng , Li Yuan , Pengfei Wan , Di ZhangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.
- [1403] arXiv:2404.09622 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: DIDLM:A Comprehensive Multi-Sensor Dataset with Infrared Cameras, Depth Cameras, LiDAR, and 4D Millimeter-Wave Radar in Challenging Scenarios for 3D MappingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: This study presents a comprehensive multi-sensor dataset designed for 3D mapping in challenging indoor and outdoor environments. The dataset comprises data from infrared cameras, depth cameras, LiDAR, and 4D millimeter-wave radar, facilitating exploration of advanced perception and mapping techniques. Integration of diverse sensor data enhances perceptual capabilities in extreme conditions such as rain, snow, and uneven road surfaces. The dataset also includes interactive robot data at different speeds indoors and outdoors, providing a realistic background environment. Slam comparisons between similar routes are conducted, analyzing the influence of different complex scenes on various sensors. Various SLAM algorithms are employed to process the dataset, revealing performance differences among algorithms in different scenarios. In summary, this dataset addresses the problem of data scarcity in special environments, fostering the development of perception and mapping algorithms for extreme conditions. Leveraging multi-sensor data including infrared, depth cameras, LiDAR, 4D millimeter-wave radar, and robot interactions, the dataset advances intelligent mapping and perception capabilities.Our dataset is available at this https URL .
- [1404] arXiv:2404.09625 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Privacy-Preserving Intrusion Detection using Convolutional Neural NetworksComments: Accepted at IEEE Conference on Artificial Intelligence (CAI) 2024Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Privacy-preserving analytics is designed to protect valuable assets. A common service provision involves the input data from the client and the model on the analyst's side. The importance of the privacy preservation is fuelled by legal obligations and intellectual property concerns. We explore the use case of a model owner providing an analytic service on customer's private data. No information about the data shall be revealed to the analyst and no information about the model shall be leaked to the customer. Current methods involve costs: accuracy deterioration and computational complexity. The complexity, in turn, results in a longer processing time, increased requirement on computing resources, and involves data communication between the client and the server. In order to deploy such service architecture, we need to evaluate the optimal setting that fits the constraints. And that is what this paper addresses. In this work, we enhance an attack detection system based on Convolutional Neural Networks with privacy-preserving technology based on PriMIA framework that is initially designed for medical data.
- [1405] arXiv:2404.09636 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: All-in-one simulation-based inferenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Amortized Bayesian inference trains neural networks to solve stochastic inference problems using model simulations, thereby making it possible to rapidly perform Bayesian inference for any newly observed data. However, current simulation-based amortized inference methods are simulation-hungry and inflexible: They require the specification of a fixed parametric prior, simulator, and inference tasks ahead of time. Here, we present a new amortized inference method -- the Simformer -- which overcomes these limitations. By training a probabilistic diffusion model with transformer architectures, the Simformer outperforms current state-of-the-art amortized inference approaches on benchmark tasks and is substantially more flexible: It can be applied to models with function-valued parameters, it can handle inference scenarios with missing or unstructured data, and it can sample arbitrary conditionals of the joint distribution of parameters and data, including both posterior and likelihood. We showcase the performance and flexibility of the Simformer on simulators from ecology, epidemiology, and neuroscience, and demonstrate that it opens up new possibilities and application domains for amortized Bayesian inference on simulation-based models.
- [1406] arXiv:2404.09652 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: Monitoring Second-Order HyperpropertiesComments: AAMAS 2024Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Hyperproperties express the relationship between multiple executions of a system. This is needed in many AI-related fields, such as knowledge representation and planning, to capture system properties related to knowledge, information flow, and privacy. In this paper, we study the monitoring of complex hyperproperties at runtime. Previous work in this area has either focused on the simpler problem of monitoring trace properties (which are sets of traces, while hyperproperties are sets of sets of traces) or on monitoring first-order hyperproperties, which are expressible in temporal logics with first-order quantification over traces, such as HyperLTL. We present the first monitoring algorithm for the much more expressive class of second-order hyperproperties. Second-order hyperproperties include system properties like common knowledge, which cannot be expressed in first-order logics like HyperLTL.
We introduce Hyper$^2$LTL$_f$, a temporal logic over finite traces that allows for second-order quantification over sets of traces. We study the monitoring problem in two fundamental execution models: (1) the parallel model, where a fixed number of traces is monitored in parallel, and (2) the sequential model, where an unbounded number of traces is observed sequentially, one trace after the other. For the parallel model, we show that the monitoring of the second-order hyperproperties of Hyper$^2$LTL$_f$ can be reduced to monitoring first-order hyperproperties. For the sequential model, we present a monitoring algorithm that handles second-order quantification efficiently, exploiting optimizations based on the monotonicity of subformulas, graph-based storing of executions, and fixpoint hashing. We present experimental results from a range of benchmarks, including examples from common knowledge and planning. - [1407] arXiv:2404.09682 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data AnnotationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation.
In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought (CoT) and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts. - [1408] arXiv:2404.09687 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Plus Strategies are Exponentially Slower for Planted Optima of Random HeightSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Probability (math.PR)
Abstract: We compare the $(1,\lambda)$-EA and the $(1 + \lambda)$-EA on the recently introduced benchmark DisOM, which is the OneMax function with randomly planted local optima. Previous work showed that if all local optima have the same relative height, then the plus strategy never loses more than a factor $O(n\log n)$ compared to the comma strategy. Here we show that even small random fluctuations in the heights of the local optima have a devastating effect for the plus strategy and lead to super-polynomial runtimes. On the other hand, due to their ability to escape local optima, comma strategies are unaffected by the height of the local optima and remain efficient. Our results hold for a broad class of possible distortions and show that the plus strategy, but not the comma strategy, is generally deceived by sparse unstructured fluctuations of a smooth landscape.
- [1409] arXiv:2404.09690 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Harnessing GPT-4V(ision) for Insurance: A Preliminary ExplorationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The emergence of Large Multimodal Models (LMMs) marks a significant milestone in the development of artificial intelligence. Insurance, as a vast and complex discipline, involves a wide variety of data forms in its operational processes, including text, images, and videos, thereby giving rise to diverse multimodal tasks. Despite this, there has been limited systematic exploration of multimodal tasks specific to insurance, nor a thorough investigation into how LMMs can address these challenges. In this paper, we explore GPT-4V's capabilities in the insurance domain. We categorize multimodal tasks by focusing primarily on visual aspects based on types of insurance (e.g., auto, household/commercial property, health, and agricultural insurance) and insurance stages (e.g., risk assessment, risk monitoring, and claims processing). Our experiment reveals that GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating not only a robust understanding of multimodal content in the insurance domain but also a comprehensive knowledge of insurance scenarios. However, there are notable shortcomings: GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages. Through this work, we aim to bridge the insurance domain with cutting-edge LMM technology, facilitate interdisciplinary exchange and development, and provide a foundation for the continued advancement and evolution of future research endeavors.
- [1410] arXiv:2404.09695 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language ModelsComments: 8 pages,4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And structured Pruning (LoRAP). For the MHA sub-layer, we propose an input activation weighted singular value decomposition method to strengthen the low-rank characteristic. Furthermore, we discover that the weight matrices in MHA sub-layer have different low-rank degrees. Thus, a novel parameter allocation scheme according to the discrepancy of low-rank degrees is devised. For the FFN sub-layer, we propose a gradient-free structured channel pruning method. During the pruning, we get an interesting finding that the least important 1% of parameter actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios.
- [1411] arXiv:2404.09696 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Are Large Language Models Reliable Argument Quality Annotators?Comments: 18 pages, 5 figures, 5 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Evaluating the quality of arguments is a crucial aspect of any system leveraging argument mining. However, it is a challenge to obtain reliable and consistent annotations regarding argument quality, as this usually requires domain-specific expertise of the annotators. Even among experts, the assessment of argument quality is often inconsistent due to the inherent subjectivity of this task. In this paper, we study the potential of using state-of-the-art large language models (LLMs) as proxies for argument quality annotators. To assess the capability of LLMs in this regard, we analyze the agreement between model, human expert, and human novice annotators based on an established taxonomy of argument quality dimensions. Our findings highlight that LLMs can produce consistent annotations, with a moderately high agreement with human experts across most of the quality dimensions. Moreover, we show that using LLMs as additional annotators can significantly improve the agreement between annotators. These results suggest that LLMs can serve as a valuable tool for automated argument quality assessment, thus streamlining and accelerating the evaluation of large argument datasets.
- [1412] arXiv:2404.09707 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Adaptive Patching for High-resolution Image Segmentation with TransformersEnzhi Zhang , Isaac Lyngaas , Peng Chen , Xiao Wang , Jun Igarashi , Yuankai Huo , Mohamed Wahib , Masaharu MunetomoSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model, if we are to use smaller patch sizes that are favorable in segmentation. The solution is to either use custom complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods in HPC by adaptively patching the images, as a pre-processing step, based on the image details to reduce the number of patches being fed to the model, by orders of magnitude. This method has a negligible overhead, and works seamlessly with any attention-based model, i.e. it is a pre-processing step that can be adopted by any attention-based model without friction. We demonstrate superior segmentation quality over SoTA segmentation models for real-world pathology datasets while gaining a geomean speedup of $6.9\times$ for resolutions up to $64K^2$, on up to $2,048$ GPUs.
- [1413] arXiv:2404.09715 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement LearningLinjie Xu , Zichuan Liu , Alexander Dockhorn , Diego Perez-Liebana , Jinyu Wang , Lei Song , Jiang BianSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: One of the notorious issues for Reinforcement Learning (RL) is poor sample efficiency. Compared to single agent RL, the sample efficiency for Multi-Agent Reinforcement Learning (MARL) is more challenging because of its inherent partial observability, non-stationary training, and enormous strategy space. Although much effort has been devoted to developing new methods and enhancing sample efficiency, we look at the widely used episodic training mechanism. In each training step, tens of frames are collected, but only one gradient step is made. We argue that this episodic training could be a source of poor sample efficiency. To better exploit the data already collected, we propose to increase the frequency of the gradient updates per environment interaction (a.k.a. Replay Ratio or Update-To-Data ratio). To show its generality, we evaluate $3$ MARL methods on $6$ SMAC tasks. The empirical results validate that a higher replay ratio significantly improves the sample efficiency for MARL algorithms. The codes to reimplement the results presented in this paper are open-sourced at https://anonymous.4open.science/r/rr_for_MARL-0D83/.
- [1414] arXiv:2404.09717 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Unveiling Imitation Learning: Exploring the Impact of Data Falsity to Large Language ModelComments: Under review @ *ACLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Many recent studies endeavor to improve open-source language models through imitation learning, and re-training on the synthetic instruction data from state-of-the-art proprietary models like ChatGPT and GPT-4. However, the innate nature of synthetic data inherently contains noisy data, giving rise to a substantial presence of low-quality data replete with erroneous responses, and flawed reasoning. Although we intuitively grasp the potential harm of noisy data, we lack a quantitative understanding of its impact. To this end, this paper explores the correlation between the degree of noise and its impact on language models through instruction tuning. We first introduce the Falsity-Controllable (FACO) dataset, which comprises pairs of true answers with corresponding reasoning, as well as false pairs to manually control the falsity ratio of the dataset.Through our extensive experiments, we found multiple intriguing findings of the correlation between the factuality of the dataset and instruction tuning: Specifically, we verified falsity of the instruction is highly relevant to various benchmark scores. Moreover, when LLMs are trained with false instructions, they learn to lie and generate fake unfaithful answers, even though they know the correct answer for the user request. Additionally, we noted that once the language model is trained with a dataset contaminated by noise, restoring its original performance is possible, but it failed to reach full performance.
- [1415] arXiv:2404.09722 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data PublicationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, good data is not a free lunch and is always hard to access due to privacy regulations like the General Data Protection Regulation (GDPR). A potential solution is to release a synthetic dataset with a similar distribution to that of the private dataset. Nevertheless, in some scenarios, it has been found that the attributes needed to train an AI model belong to different parties, and they cannot share the raw data for synthetic data publication due to privacy regulations. In PETS 2023, Xue et al. proposed the first generative adversary network-based model, VertiGAN, for vertically partitioned data publication. However, after thoroughly investigating, we found that VertiGAN is less effective in preserving the correlation among the attributes of different parties. This article proposes a Vertical Federated Learning-based Generative Adversarial Network, VFLGAN, for vertically partitioned data publication to address the above issues. Our experimental results show that compared with VertiGAN, VFLGAN significantly improves the quality of synthetic data. Taking the MNIST dataset as an example, the quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN w.r.t. the Fréchet Distance. We also designed a more efficient and effective Gaussian mechanism for the proposed VFLGAN to provide the synthetic dataset with a differential privacy guarantee. On the other hand, differential privacy only gives the upper bound of the worst-case privacy guarantee. This article also proposes a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.
- [1416] arXiv:2404.09738 (cross-list from q-bio.BM) [ pdf , ps , other ]
-
Title: AMPCliff: quantitative definition and benchmarking of activity cliffs in antimicrobial peptidesKewei Li , Yuqian Wu , Yutong Guo , Yinheng Li , Yusi Fan , Ruochi Zhang , Lan Huang , Fengfeng ZhouSubjects: Biomolecules (q-bio.BM) ; Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Abstract: Activity cliff (AC) is a phenomenon that a pair of similar molecules differ by a small structural alternation but exhibit a large difference in their biochemical activities. The AC of small molecules has been extensively investigated but limited knowledge is accumulated about the AC phenomenon in peptides with canonical amino acids. This study introduces a quantitative definition and benchmarking framework AMPCliff for the AC phenomenon in antimicrobial peptides (AMPs) composed by canonical amino acids. A comprehensive analysis of the existing AMP dataset reveals a significant prevalence of AC within AMPs. AMPCliff quantifies the activities of AMPs by the metric minimum inhibitory concentration (MIC), and defines 0.9 as the minimum threshold for the normalized BLOSUM62 similarity score between a pair of aligned peptides with at least two-fold MIC changes. This study establishes a benchmark dataset of paired AMPs in Staphylococcus aureus from the publicly available AMP dataset GRAMPA, and conducts a rigorous procedure to evaluate various AMP AC prediction models, including nine machine learning, four deep learning algorithms, four masked language models, and four generative language models. Our analysis reveals that these models are capable of detecting AMP AC events and the pre-trained protein language ESM2 model demonstrates superior performance across the evaluations. The predictive performance of AMP activity cliffs remains to be further improved, considering that ESM2 with 33 layers only achieves the Spearman correlation coefficient=0.50 for the regression task of the MIC values on the benchmark dataset. Source code and additional resources are available at this https URL or this https URL .
- [1417] arXiv:2404.09752 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Can We Break Free from Strong Data Augmentations in Self-Supervised Learning?Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their crucial role in shaping SSL model performance and learning mechanisms. Leveraging these insights, we propose a novel learning approach that integrates prior knowledge, with the aim of curtailing the need for extensive data augmentations and thereby amplifying the efficacy of learned representations. Notably, our findings underscore that SSL models imbued with prior knowledge exhibit reduced texture bias, diminished reliance on shortcuts and augmentations, and improved robustness against both natural and adversarial corruptions. These findings not only illuminate a new direction in SSL research, but also pave the way for enhancing DNN performance while concurrently alleviating the imperative for intensive data augmentation, thereby enhancing scalability and real-world problem-solving capabilities.
- [1418] arXiv:2404.09760 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Effective Reinforcement Learning Based on Structural Information PrinciplesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Although Reinforcement Learning (RL) algorithms acquire sequential behavioral patterns through interactions with the environment, their effectiveness in noisy and high-dimensional scenarios typically relies on specific structural priors. In this paper, we propose a novel and general Structural Information principles-based framework for effective Decision-Making, namely SIDM, approached from an information-theoretic perspective. This paper presents a specific unsupervised partitioning method that forms vertex communities in the state and action spaces based on their feature similarities. An aggregation function, which utilizes structural entropy as the vertex weight, is devised within each community to obtain its embedding, thereby facilitating hierarchical state and action abstractions. By extracting abstract elements from historical trajectories, a directed, weighted, homogeneous transition graph is constructed. The minimization of this graph's high-dimensional entropy leads to the generation of an optimal encoding tree. An innovative two-layer skill-based learning mechanism is introduced to compute the common path entropy of each state transition as its identified probability, thereby obviating the requirement for expert knowledge. Moreover, SIDM can be flexibly incorporated into various single-agent and multi-agent RL algorithms, enhancing their performance. Finally, extensive evaluations on challenging benchmarks demonstrate that, compared with SOTA baselines, our framework significantly and consistently improves the policy's quality, stability, and efficiency up to 32.70%, 88.26%, and 64.86%, respectively.
- [1419] arXiv:2404.09763 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: KG-CTG: Citation Generation through Knowledge Graph-guided Large Language ModelsAvinash Anand , Mohit Gupta , Kritarth Prasad , Ujjwal Goel , Naman Lal , Astha Verma , Rajiv Ratn ShahSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Citation Text Generation (CTG) is a task in natural language processing (NLP) that aims to produce text that accurately cites or references a cited document within a source document. In CTG, the generated text draws upon contextual cues from both the source document and the cited paper, ensuring accurate and relevant citation information is provided. Previous work in the field of citation generation is mainly based on the text summarization of documents. Following this, this paper presents a framework, and a comparative study to demonstrate the use of Large Language Models (LLMs) for the task of citation generation. Also, we have shown the improvement in the results of citation generation by incorporating the knowledge graph relations of the papers in the prompt for the LLM to better learn the relationship between the papers. To assess how well our model is performing, we have used a subset of standard S2ORC dataset, which only consists of computer science academic research papers in the English Language. Vicuna performs best for this task with 14.15 Meteor, 12.88 Rouge-1, 1.52 Rouge-2, and 10.94 Rouge-L. Also, Alpaca performs best, and improves the performance by 36.98% in Rouge-1, and 33.14% in Meteor by including knowledge graphs.
- [1420] arXiv:2404.09828 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Interaction as Explanation: A User Interaction-based Method for Explaining Image Classification ModelsComments: 5 pages, 2 figures, 1 tableSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: In computer vision, explainable AI (xAI) methods seek to mitigate the 'black-box' problem by making the decision-making process of deep learning models more interpretable and transparent. Traditional xAI methods concentrate on visualizing input features that influence model predictions, providing insights primarily suited for experts. In this work, we present an interaction-based xAI method that enhances user comprehension of image classification models through their interaction. Thus, we developed a web-based prototype allowing users to modify images via painting and erasing, thereby observing changes in classification results. Our approach enables users to discern critical features influencing the model's decision-making process, aligning their mental models with the model's logic. Experiments conducted with five images demonstrate the potential of the method to reveal feature importance through user interaction. Our work contributes a novel perspective to xAI by centering on end-user engagement and understanding, paving the way for more intuitive and accessible explainability in AI systems.
- [1421] arXiv:2404.09830 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Negation Triplet Extraction with Syntactic Dependency and Semantic ConsistencyComments: Accepted by COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Previous works of negation understanding mainly focus on negation cue detection and scope resolution, without identifying negation subject which is also significant to the downstream tasks. In this paper, we propose a new negation triplet extraction (NTE) task which aims to extract negation subject along with negation cue and scope. To achieve NTE, we devise a novel Syntax&Semantic-Enhanced Negation Extraction model, namely SSENE, which is built based on a generative pretrained language model (PLM) {of Encoder-Decoder architecture} with a multi-task learning framework. Specifically, the given sentence's syntactic dependency tree is incorporated into the PLM's encoder to discover the correlations between the negation subject, cue and scope. Moreover, the semantic consistency between the sentence and the extracted triplet is ensured by an auxiliary task learning. Furthermore, we have constructed a high-quality Chinese dataset NegComment based on the users' reviews from the real-world platform of Meituan, upon which our evaluations show that SSENE achieves the best NTE performance compared to the baselines. Our ablation and case studies also demonstrate that incorporating the syntactic information helps the PLM's recognize the distant dependency between the subject and cue, and the auxiliary task learning is helpful to extract the negation triplets with more semantic consistency.
- [1422] arXiv:2404.09833 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single VideoComments: CVPR 2024. Project page (with code): this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Creating high-quality and interactive virtual environments, such as games and simulators, often involves complex and costly manual modeling processes. In this paper, we present Video2Game, a novel approach that automatically converts videos of real-world scenes into realistic and interactive game environments. At the heart of our system are three core components:(i) a neural radiance fields (NeRF) module that effectively captures the geometry and visual appearance of the scene; (ii) a mesh module that distills the knowledge from NeRF for faster rendering; and (iii) a physics module that models the interactions and physical dynamics among the objects. By following the carefully designed pipeline, one can construct an interactable and actionable digital replica of the real world. We benchmark our system on both indoor and large-scale outdoor scenes. We show that we can not only produce highly-realistic renderings in real-time, but also build interactive games on top.
- [1423] arXiv:2404.09857 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned tracker from the virtual world to real-world scenarios.
- [1424] arXiv:2404.09889 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Is Table Retrieval a Solved Problem? Join-Aware Multi-Table RetrievalSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. Previous methods assume the answer to such a question can be found either in a single table or multiple tables identified through question decomposition or rewriting. However, neither of these approaches is sufficient, as many questions require retrieving multiple tables and joining them through a join plan that cannot be discerned from the user query itself. If the join plan is not considered in the retrieval stage, the subsequent steps of reasoning and answering based on those retrieved tables are likely to be incorrect. To address this problem, we introduce a method that uncovers useful join relations for any query and database during table retrieval. We use a novel re-ranking method formulated as a mixed-integer program that considers not only table-query relevance but also table-table relevance that requires inferring join relationships. Our method outperforms the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score and for end-to-end QA by up to 5.4% in accuracy.
- [1425] arXiv:2404.09921 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Zero-shot Building Age Classification from Facade Image Using GPT-4Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: A building's age of construction is crucial for supporting many geospatial applications. Much current research focuses on estimating building age from facade images using deep learning. However, building an accurate deep learning model requires a considerable amount of labelled training data, and the trained models often have geographical constraints. Recently, large pre-trained vision language models (VLMs) such as GPT-4 Vision, which demonstrate significant generalisation capabilities, have emerged as potential training-free tools for dealing with specific vision tasks, but their applicability and reliability for building information remain unexplored. In this study, a zero-shot building age classifier for facade images is developed using prompts that include logical instructions. Taking London as a test case, we introduce a new dataset, FI-London, comprising facade images and building age epochs. Although the training-free classifier achieved a modest accuracy of 39.69%, the mean absolute error of 0.85 decades indicates that the model can predict building age epochs successfully albeit with a small bias. The ensuing discussion reveals that the classifier struggles to predict the age of very old buildings and is challenged by fine-grained predictions within 2 decades. Overall, the classifier utilising GPT-4 Vision is capable of predicting the rough age epoch of a building from a single facade image without any training.
- [1426] arXiv:2404.09931 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Zero-shot detection of buildings in mobile LiDAR using Language Vision ModelComments: 7 pages, 6 figures, conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advances have demonstrated that Language Vision Models (LVMs) surpass the existing State-of-the-Art (SOTA) in two-dimensional (2D) computer vision tasks, motivating attempts to apply LVMs to three-dimensional (3D) data. While LVMs are efficient and effective in addressing various downstream 2D vision tasks without training, they face significant challenges when it comes to point clouds, a representative format for representing 3D data. It is more difficult to extract features from 3D data and there are challenges due to large data sizes and the cost of the collection and labelling, resulting in a notably limited availability of datasets. Moreover, constructing LVMs for point clouds is even more challenging due to the requirements for large amounts of data and training time. To address these issues, our research aims to 1) apply the Grounded SAM through Spherical Projection to transfer 3D to 2D, and 2) experiment with synthetic data to evaluate its effectiveness in bridging the gap between synthetic and real-world data domains. Our approach exhibited high performance with an accuracy of 0.96, an IoU of 0.85, precision of 0.92, recall of 0.91, and an F1 score of 0.92, confirming its potential. However, challenges such as occlusion problems and pixel-level overlaps of multi-label points during spherical image generation remain to be addressed in future studies.
- [1427] arXiv:2404.09932 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Foundational Challenges in Assuring Alignment and Safety of Large Language ModelsUsman Anwar , Abulhair Saparov , Javier Rando , Daniel Paleka , Miles Turpin , Peter Hase , Ekdeep Singh Lubana , Erik Jenner , Stephen Casper , Oliver Sourbut , Benjamin L. Edelman , Zhaowei Zhang , Mario Günther , Anton Korinek , Jose Hernandez-Orallo , Lewis Hammond , Eric Bigelow , Alexander Pan , Lauro Langosco , Tomasz Korbak , Heidi Zhang , Ruiqi Zhong , Seán Ó hÉigeartaigh , Gabriel Recchia , Giulio Corsi , Alan Chan , Markus Anderljung , Lilian Edwards , Yoshua Bengio , Danqi Chen , Samuel Albanie , Tegan Maharaj , Jakob Foerster , Florian Tramer , He He , Atoosa Kasirzadeh , Yejin Choi , David KruegerSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
- [1428] arXiv:2404.09937 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Compression Represents Intelligence LinearlyComments: Preprint. Data and code are available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract: There is a belief that learning to compress well will lead to intelligence. Recently, language modeling has been shown to be equivalent to compression, which offers a compelling rationale for the success of large language models (LLMs): the development of more advanced language models is essentially enhancing compression which facilitates intelligence. Despite such appealing discussions, little empirical evidence is present for the interplay between compression and intelligence. In this work, we examine their relationship in the context of LLMs, treating LLMs as data compressors. Given the abstract concept of "intelligence", we adopt the average downstream benchmark scores as a surrogate, specifically targeting intelligence related to knowledge and commonsense, coding, and mathematical reasoning. Across 12 benchmarks, our study brings together 30 public LLMs that originate from diverse organizations. Remarkably, we find that LLMs' intelligence -- reflected by average benchmark scores -- almost linearly correlates with their ability to compress external text corpora. These results provide concrete evidence supporting the belief that superior compression indicates greater intelligence. Furthermore, our findings suggest that compression efficiency, as an unsupervised metric derived from raw text corpora, serves as a reliable evaluation measure that is linearly associated with the model capabilities. We open-source our compression datasets as well as our data collection pipelines to facilitate future researchers to assess compression properly.
- [1429] arXiv:2404.09941 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Evolving Interpretable Visual Classifiers with Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.
- [1430] arXiv:2404.09946 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Note on Loss Functions and Error Compounding in Model-based Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: This note clarifies some confusions (and perhaps throws out more) around model-based reinforcement learning and their theoretical understanding in the context of deep RL. Main topics of discussion are (1) how to reconcile model-based RL's bad empirical reputation on error compounding with its superior theoretical properties, and (2) the limitations of empirically popular losses. For the latter, concrete counterexamples for the "MuZero loss" are constructed to show that it not only fails in stochastic environments, but also suffers exponential sample complexity in deterministic environments when data provides sufficient coverage.
- [1431] arXiv:2404.09956 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference OptimizationComments: this https URLSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Abstract: Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.
- [1432] arXiv:2404.09967 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion ModelComments: First two authors contributed equally; Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: ControlNets are widely used for adding spatial control in image generation with different conditions, such as depth maps, canny edges, and human poses. However, there are several challenges when leveraging the pretrained image ControlNets for controlled video generation. First, pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden. Second, ControlNet features for different frames might not effectively handle the temporal consistency. To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models, by adapting pretrained ControlNets (and improving temporal alignment for videos). Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing. In Ctrl-Adapter, we train adapter layers that fuse pretrained ControlNet features to different image/video diffusion models, while keeping the parameters of the ControlNets and the diffusion models frozen. Ctrl-Adapter consists of temporal and spatial modules so that it can effectively handle the temporal consistency of videos. We also propose latent skipping and inverse timestep sampling for robust adaptation and sparse control. Moreover, Ctrl-Adapter enables control from multiple conditions by simply taking the (weighted) average of ControlNet outputs. With diverse image/video diffusion backbones (SDXL, Hotshot-XL, I2VGen-XL, and SVD), Ctrl-Adapter matches ControlNet for image control and outperforms all baselines for video control (achieving the SOTA accuracy on the DAVIS 2017 dataset) with significantly lower computational costs (less than 10 GPU hours).
- [1433] arXiv:2404.09990 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HQ-Edit: A High-Quality Dataset for Instruction-based Image EditingMude Hui , Siwei Yang , Bingchen Zhao , Yichun Shi , Heng Wang , Peng Wang , Yuyin Zhou , Cihang XieComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs featuring input and output images with detailed text prompts, followed by precise alignment ensured through post-processing. In addition, we propose two evaluation metrics, Alignment and Coherence, to quantitatively assess the quality of image edit pairs using GPT-4V. HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models. For example, an HQ-Edit finetuned InstructPix2Pix can attain state-of-the-art image editing performance, even surpassing those models fine-tuned with human-annotated data. The project page is this https URL .
- [1434] arXiv:2404.09992 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MMInA: Benchmarking Multihop Multimodal Internet AgentsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at this https URL
- [1435] arXiv:2404.09995 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Taming Latent Diffusion Model for Neural Radiance Field InpaintingChieh Hubert Lin , Changil Kim , Jia-Bin Huang , Qinbo Li , Chih-Yao Ma , Johannes Kopf , Ming-Hsuan Yang , Hung-Yu TsengComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: this https URL
- [1436] arXiv:2404.10014 (cross-list from cs.MA) [ pdf , ps , other ]
-
Title: A biologically inspired computational trust model for open multi-agent systems which is resilient to trustor population changesSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI)
Abstract: Current trust and reputation models continue to have significant limitations, such as the inability to deal with agents constantly entering or exiting open multi-agent systems (open MAS), as well as continuously changing behaviors. Our study is based on CA, a previously proposed decentralized computational trust model from the trustee's point of view, inspired by synaptic plasticity and the formation of assemblies in the human brain. It is designed to meet the requirements of highly dynamic and open MAS, and its main difference with most conventional trust and reputation models is that the trustor does not select a trustee to delegate a task; instead, the trustee determines whether it is qualified to successfully execute it. We ran a series of simulations to compare CA model to FIRE, a well-established, decentralized trust and reputation model for open MAS under conditions of continuous trustee and trustor population replacement, as well as continuous change of trustees' abilities to perform tasks. The main finding is that FIRE is superior to changes in the trustee population, whereas CA is resilient to the trustor population changes. When the trustees switch performance profiles FIRE clearly outperforms despite the fact that both models' performances are significantly impacted by this environmental change. Findings lead us to conclude that learning to use the appropriate trust model, according to the dynamic conditions in effect could maximize the trustor's benefits.
- [1437] arXiv:2404.10017 (cross-list from quant-ph) [ pdf , ps , html , other ]
-
Title: Model-based Offline Quantum Reinforcement LearningSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.
- [1438] arXiv:2404.10019 (cross-list from astro-ph.IM) [ pdf , ps , html , other ]
-
Title: Can AI Understand Our Universe? Test of Fine-Tuning GPT by Astrophysical DataYu Wang , Shu-Rui Zhang , Aidin Momtaz , Rahim Moradi , Fatemeh Rastegarnia , Narek Sahakyan , Soroush Shakeri , Liang LiComments: 27 pages, 7 figures. Comments welcomeSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM) ; Astrophysics of Galaxies (astro-ph.GA); High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
Abstract: ChatGPT has been the most talked-about concept in recent months, captivating both professionals and the general public alike, and has sparked discussions about the changes that artificial intelligence (AI) will bring to the world. As physicists and astrophysicists, we are curious about if scientific data can be correctly analyzed by large language models (LLMs) and yield accurate physics. In this article, we fine-tune the generative pre-trained transformer (GPT) model by the astronomical data from the observations of galaxies, quasars, stars, gamma-ray bursts (GRBs), and the simulations of black holes (BHs), the fine-tuned model demonstrates its capability to classify astrophysical phenomena, distinguish between two types of GRBs, deduce the redshift of quasars, and estimate BH parameters. We regard this as a successful test, marking the LLM's proven efficacy in scientific research. With the ever-growing volume of multidisciplinary data and the advancement of AI technology, we look forward to the emergence of a more fundamental and comprehensive understanding of our universe. This article also shares some interesting thoughts on data collection and AI design. Using the approach of understanding the universe - looking outward at data and inward for fundamental building blocks - as a guideline, we propose a method of series expansion for AI, suggesting ways to train and control AI that is smarter than humans.
- [1439] arXiv:2404.10031 (cross-list from q-bio.NC) [ pdf , ps , other ]
-
Title: Emergent Language Symbolic Autoencoder (ELSA) with Weak Supervision to Model Hierarchical Brain NetworksComments: 10 pages, 4 figuresSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Brain networks display a hierarchical organization, a complexity that poses a challenge for existing deep learning models, often structured as flat classifiers, leading to difficulties in interpretability and the 'black box' issue. To bridge this gap, we propose a novel architecture: a symbolic autoencoder informed by weak supervision and an Emergent Language (EL) framework. This model moves beyond traditional flat classifiers by producing hierarchical clusters and corresponding imagery, subsequently represented through symbolic sentences to improve the clinical interpretability of hierarchically organized data such as intrinsic brain networks, which can be characterized using resting-state fMRI images. Our innovation includes a generalized hierarchical loss function designed to ensure that both sentences and images accurately reflect the hierarchical structure of functional brain networks. This enables us to model functional brain networks from a broader perspective down to more granular details. Furthermore, we introduce a quantitative method to assess the hierarchical consistency of these symbolic representations. Our qualitative analyses show that our model successfully generates hierarchically organized, clinically interpretable images, a finding supported by our quantitative evaluations. We find that our best performing loss function leads to a hierarchical consistency of over 97% when identifying images corresponding to brain networks. This approach not only advances the interpretability of deep learning models in neuroimaging analysis but also represents a significant step towards modeling the intricate hierarchical nature of brain networks.
- [1440] arXiv:2404.10054 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AIGeN: An Adversarial Approach for Instruction Generation in VLNComments: Accepted to 7th Multimodal Learning and Applications Workshop (MULA 2024) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Abstract: In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance.
- [1441] arXiv:2404.10096 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Vision Augmentation Prediction Autoencoder with Attention Design (VAPAAD)Comments: 12 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in sequence prediction have significantly improved the accuracy of video data interpretation; however, existing models often overlook the potential of attention-based mechanisms for next-frame prediction. This study introduces the Vision Augmentation Prediction Autoencoder with Attention Design (VAPAAD), an innovative approach that integrates attention mechanisms into sequence prediction, enabling nuanced analysis and understanding of temporal dynamics in video sequences. Utilizing the Moving MNIST dataset, we demonstrate VAPAAD's robust performance and superior handling of complex temporal data compared to traditional methods. VAPAAD combines data augmentation, ConvLSTM2D layers, and a custom-built self-attention mechanism to effectively focus on salient features within a sequence, enhancing predictive accuracy and context-aware analysis. This methodology not only adheres to human cognitive processes during video interpretation but also addresses limitations in conventional models, which often struggle with the variability inherent in video sequences. The experimental results confirm that VAPAAD outperforms existing models, especially in integrating attention mechanisms, which significantly improve predictive performance.
- [1442] arXiv:2404.10135 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Using Long Short-term Memory (LSTM) to merge precipitation data over mountainous area in Sierra NevadaSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: Obtaining reliable precipitation estimation with high resolutions in time and space is of great importance to hydrological studies. However, accurately estimating precipitation is a challenging task over high mountainous complex terrain. The three widely used precipitation measurement approaches, namely rainfall gauge, precipitation radars, and satellite-based precipitation sensors, have their own pros and cons in producing reliable precipitation products over complex areas. One way to decrease the detection error probability and improve data reliability is precipitation data merging. With the rapid advancements in computational capabilities and the escalating volume and diversity of earth observational data, Deep Learning (DL) models have gained considerable attention in geoscience. In this study, a deep learning technique, namely Long Short-term Memory (LSTM), was employed to merge a radar-based and a satellite-based Global Precipitation Measurement (GPM) precipitation product Integrated Multi-Satellite Retrievals for GPM (IMERG) precipitation product at hourly scale. The merged results are compared with the widely used reanalysis precipitation product, Multi-Radar Multi-Sensor (MRMS), and assessed against gauge observational data from the California Data Exchange Center (CDEC). The findings indicated that the LSTM-based merged precipitation notably underestimated gauge observations and, at times, failed to provide meaningful estimates, showing predominantly near-zero values. Relying solely on individual Quantitative Precipitation Estimates (QPEs) without additional meteorological input proved insufficient for generating reliable merged QPE. However, the merged results effectively captured the temporal trends of the observations, outperforming MRMS in this aspect. This suggested that incorporating bias correction techniques could potentially enhance the accuracy of the merged product.
- [1443] arXiv:2404.10136 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Language Model Cascades: Token-level uncertainty and beyondNeha Gupta , Harikrishna Narasimhan , Wittawat Jitkrittum , Ankit Singh Rawat , Aditya Krishna Menon , Sanjiv KumarSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across examples. To mitigate this issue, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.
- [1444] arXiv:2404.10142 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Shaping Realities: Enhancing 3D Generative AI with Fabrication ConstraintsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Generative AI tools are becoming more prevalent in 3D modeling, enabling users to manipulate or create new models with text or images as inputs. This makes it easier for users to rapidly customize and iterate on their 3D designs and explore new creative ideas. These methods focus on the aesthetic quality of the 3D models, refining them to look similar to the prompts provided by the user. However, when creating 3D models intended for fabrication, designers need to trade-off the aesthetic qualities of a 3D model with their intended physical properties. To be functional post-fabrication, 3D models have to satisfy structural constraints informed by physical principles. Currently, such requirements are not enforced by generative AI tools. This leads to the development of aesthetically appealing, but potentially non-functional 3D geometry, that would be hard to fabricate and use in the real world. This workshop paper highlights the limitations of generative AI tools in translating digital creations into the physical world and proposes new augmentations to generative AI tools for creating physically viable 3D models. We advocate for the development of tools that manipulate or generate 3D models by considering not only the aesthetic appearance but also using physical properties as constraints. This exploration seeks to bridge the gap between digital creativity and real-world applicability, extending the creative potential of generative AI into the tangible domain.
- [1445] arXiv:2404.10162 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Optimal Kernel Tuning Parameter Prediction using Deep Sequence ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: GPU kernels have come to the forefront of computing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical application, which makes their performance highly critical. Due to the unknown nature of the optimization surface, an exhaustive search is required to discover the global optimum, which is infeasible due to the possible exponential number of parameter combinations. In this work, we propose a methodology that uses deep sequence-to-sequence models to predict the optimal tuning parameters governing compute kernels. This work considers the prediction of kernel parameters as a sequence to the sequence translation problem, borrowing models from the Natural Language Processing (NLP) domain. Parameters describing the input, output and weight tensors are considered as the input language to the model that emits the corresponding kernel parameters. In essence, the model translates the problem parameter language to kernel parameter language. The core contributions of this work are: a) Proposing that a sequence to sequence model can accurately learn the performance dynamics of a GPU compute kernel b) A novel network architecture which predicts the kernel tuning parameters for GPU kernels, c) A constrained beam search which incorporates the physical limits of the GPU hardware as well as other expert knowledge reducing the search space. The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library. As a result, the proposed technique can reduce the development time and compute resources required to tune unseen input configurations, resulting in shorter development cycles, reduced development costs, and better user experience.
- [1446] arXiv:2404.10163 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: From a visual perception perspective, modern graphical user interfaces (GUIs) comprise a complex graphics-rich two-dimensional visuospatial arrangement of text, images, and interactive objects such as buttons and menus. While existing models can accurately predict regions and objects that are likely to attract attention ``on average'', so far there is no scanpath model capable of predicting scanpaths for an individual. To close this gap, we introduce EyeFormer, which leverages a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that controls gaze locations. Our model has the unique capability of producing personalized predictions when given a few user scanpath samples. It can predict full scanpath information, including fixation positions and duration, across individuals and various stimulus types. Additionally, we demonstrate applications in GUI layout optimization driven by our model. Our software and models will be publicly available.
- [1447] arXiv:2404.10170 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: High-Resolution Detection of Earth Structural Heterogeneities from Seismic Amplitudes using Convolutional Neural Networks with Attention layersSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Earth structural heterogeneities have a remarkable role in the petroleum economy for both exploration and production projects. Automatic detection of detailed structural heterogeneities is challenging when considering modern machine learning techniques like deep neural networks. Typically, these techniques can be an excellent tool for assisted interpretation of such heterogeneities, but it heavily depends on the amount of data to be trained.
We propose an efficient and cost-effective architecture for detecting seismic structural heterogeneities using Convolutional Neural Networks (CNNs) combined with Attention layers. The attention mechanism reduces costs and enhances accuracy, even in cases with relatively noisy data. Our model has half the parameters compared to the state-of-the-art, and it outperforms previous methods in terms of Intersection over Union (IoU) by 0.6% and precision by 0.4%. By leveraging synthetic data, we apply transfer learning to train and fine-tune the model, addressing the challenge of limited annotated data availability. - [1448] arXiv:2404.10177 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy DataComments: Preprint, work in progress. 19 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data. Both Ambient Diffusion and alternative SURE-based approaches for learning diffusion models from corrupted data resort to approximations which deteriorate performance. We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data, solving an open problem in this space. Our key technical contribution is a method that uses a double application of Tweedie's formula and a consistency loss function that allows us to extend sampling at noise levels below the observed data noise. We also provide further evidence that diffusion models memorize from their training sets by identifying extremely corrupted images that are almost perfectly reconstructed, raising copyright and privacy concerns. Our method for training using corrupted samples can be used to mitigate this problem. We demonstrate this by fine-tuning Stable Diffusion XL to generate samples from a distribution using only noisy samples. Our framework reduces the amount of memorization of the fine-tuning dataset, while maintaining competitive performance.
- [1449] arXiv:2404.10179 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Scaling Instructable Agents Across Many Simulated WorldsSIMA Team , Maria Abi Raad , Arun Ahuja , Catarina Barros , Frederic Besse , Andrew Bolt , Adrian Bolton , Bethanie Brownfield , Gavin Buttimore , Max Cant , Sarah Chakera , Stephanie C. Y. Chan , Jeff Clune , Adrian Collister , Vikki Copeman , Alex Cullum , Ishita Dasgupta , Dario de Cesare , Julia Di Trapani , Yani Donchev , Emma Dunleavy , Martin Engelcke , Ryan Faulkner , Frankie Garcia , Charles Gbadamosi , Zhitao Gong , Lucy Gonzales , Kshitij Gupta , Karol Gregor , Arne Olav Hallingstad , Tim Harley , Sam Haves , Felix Hill , Ed Hirst , Drew A. Hudson , Jony Hudson , Steph Hughes-Fitt , Danilo J. Rezende , Mimi Jasarevic , Laura Kampis , Rosemary Ke , Thomas Keck , Junkyung Kim , Oscar Knagg , Kavya Kopparapu , Andrew Lampinen , Shane Legg , Alexander Lerchner , Marjorie Limont , Yulan Liu , Maria Loks-Thompson , Joseph Marino , Kathryn Martin Cussons , Loic Matthey , Siobhan Mcloughlin , Piermaria Mendolicchio , Hamza Merzic , Anna Mitenkova , Alexandre Moufarek , Valeria Oliveira , Yanko Oliveira , Hannah Openshaw , Renke Pan , Aneesh Pappu , Alex Platonov , Ollie Purkiss , David Reichert , John Reid , Pierre Harvey Richemond , Tyson Roberts , Giles Ruscoe , Jaume Sanchez Elias , Tasha Sandars , Daniel P. Sawyer , Tim Scholtes , Guy Simmons , Daniel Slater , Hubert Soyer , Heiko Strathmann , Peter Stys , Allison C. Tam , Denis Teplyashin , Tayfun Terzi , Davide Vercelli , Bojan Vujatovic , Marcus Wainwright , Jane X. Wang , Zhengdong Wang , Daan Wierstra , Duncan Williams , Nathaniel Wong , Sarah York , Nick YoungSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
- [1450] arXiv:2404.10180 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASRZelin Wu , Gan Song , Christopher Li , Pat Rondon , Zhong Meng , Xavier Velez , Weiran Wang , Diamantino Caseiro , Golan Pundak , Tsendsuren Munkhdalai , Angad Chandorkar , Rohit PrabhavalkarComments: 9 pages, 3 figures, accepted by NAACL 2024 - Industry TrackJournal-ref: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics - Industry TrackSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Abstract: Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.
- [1451] arXiv:2404.10198 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal priorSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Retrieval augmented generation (RAG) is often used to fix hallucinations and provide up-to-date knowledge for large language models (LLMs). However, in cases when the LLM alone incorrectly answers a question, does providing the correct retrieved content always fix the error? Conversely, in cases where the retrieved content is incorrect, does the LLM know to ignore the wrong information, or does it recapitulate the error? To answer these questions, we systematically analyze the tug-of-war between a LLM's internal knowledge (i.e. its prior) and the retrieved information in settings when they disagree. We test GPT-4 and other LLMs on question-answering abilities across datasets with and without reference documents. As expected, providing the correct retrieved information fixes most model mistakes (94% accuracy). However, when the reference document is perturbed with increasing levels of wrong values, the LLM is more likely to recite the incorrect, modified information when its internal prior is weaker but is more resistant when its prior is stronger. Similarly, we also find that the more the modified information deviates from the model's prior, the less likely the model is to prefer it. These results highlight an underlying tension between a model's prior knowledge and the information presented in reference documents.
- [1452] arXiv:2404.10199 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language PromptingHuihan Li , Liwei Jiang , Jena D. Huang , Hyunwoo Kim , Sebastin Santy , Taylor Sorensen , Bill Yuchen Lin , Nouha Dziri , Xiang Ren , Yejin ChoiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: As the utilization of large language models (LLMs) has proliferated worldwide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are associated to each culture by the LLM. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures. We also discover that LLMs have an uneven degree of diversity in the culture symbols, and that cultures from different geographic regions have different presence in LLMs' culture-agnostic generation. Our findings promote further research in studying the knowledge and fairness of global culture perception in LLMs. Code and Data can be found in: this https URL
- [1453] arXiv:2404.10202 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards a Novel Perspective on Adversarial Examples Driven by FrequencySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Enhancing our understanding of adversarial examples is crucial for the secure application of machine learning models in real-world scenarios. A prevalent method for analyzing adversarial examples is through a frequency-based approach. However, existing research indicates that attacks designed to exploit low-frequency or high-frequency information can enhance attack performance, leading to an unclear relationship between adversarial perturbations and different frequency components. In this paper, we seek to demystify this relationship by exploring the characteristics of adversarial perturbations within the frequency domain. We employ wavelet packet decomposition for detailed frequency analysis of adversarial examples and conduct statistical examinations across various frequency bands. Intriguingly, our findings indicate that significant adversarial perturbations are present within the high-frequency components of low-frequency bands. Drawing on this insight, we propose a black-box adversarial attack algorithm based on combining different frequency bands. Experiments conducted on multiple datasets and models demonstrate that combining low-frequency bands and high-frequency components of low-frequency bands can significantly enhance attack efficiency. The average attack success rate reaches 99\%, surpassing attacks that utilize a single frequency segment. Additionally, we introduce the normalized disturbance visibility index as a solution to the limitations of $L_2$ norm in assessing continuous and discrete perturbations.
- [1454] arXiv:2404.10211 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Anomaly Correction of Business Processes Using Transformer AutoencoderSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Event log records all events that occur during the execution of business processes, so detecting and correcting anomalies in event log can provide reliable guarantee for subsequent process analysis. The previous works mainly include next event prediction based methods and autoencoder-based methods. These methods cannot accurately and efficiently detect anomalies and correct anomalies at the same time, and they all rely on the set threshold to detect anomalies. To solve these problems, we propose a business process anomaly correction method based on Transformer autoencoder. By using self-attention mechanism and autoencoder structure, it can efficiently process event sequences of arbitrary length, and can directly output corrected business process instances, so that it can adapt to various scenarios. At the same time, the anomaly detection is transformed into a classification problem by means of selfsupervised learning, so that there is no need to set a specific threshold in anomaly detection. The experimental results on several real-life event logs show that the proposed method is superior to the previous methods in terms of anomaly detection accuracy and anomaly correction results while ensuring high running efficiency.
- [1455] arXiv:2404.10218 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Autonomous Implicit Indoor Scene Reconstruction with Frontier ExplorationComments: 7 pagesJournal-ref: IEEE International Conference on Robotics and Automation (ICRA 2024)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Implicit neural representations have demonstrated significant promise for 3D scene reconstruction. Recent works have extended their applications to autonomous implicit reconstruction through the Next Best View (NBV) based method. However, the NBV method cannot guarantee complete scene coverage and often necessitates extensive viewpoint sampling, particularly in complex scenes. In the paper, we propose to 1) incorporate frontier-based exploration tasks for global coverage with implicit surface uncertainty-based reconstruction tasks to achieve high-quality reconstruction. and 2) introduce a method to achieve implicit surface uncertainty using color uncertainty, which reduces the time needed for view selection. Further with these two tasks, we propose an adaptive strategy for switching modes in view path planning, to reduce time and maintain superior reconstruction quality. Our method exhibits the highest reconstruction quality among all planning methods and superior planning efficiency in methods involving reconstruction tasks. We deploy our method on a UAV and the results show that our method can plan multi-task views and reconstruct a scene with high quality.
- [1456] arXiv:2404.10220 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4VPeiyuan Zhi , Zhiyuan Zhang , Muzhi Han , Zeyu Zhang , Zhitian Li , Ziyuan Jiao , Baoxiong Jia , Siyuan HuangSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. We meticulously construct a library of action primitives for robot exploration, navigation, and manipulation, serving as callable execution modules for GPT-4V in task planning. On top of these modules, GPT-4V serves as the brain that can accomplish multimodal reasoning, generate action policy with code, verify the task progress, and provide feedback for replanning. Such design enables COME-robot to (i) actively perceive the environments, (ii) perform situated reasoning, and (iii) recover from failures. Through comprehensive experiments involving 8 challenging real-world tabletop and manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~25%) compared to state-of-the-art baseline methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
- [1457] arXiv:2404.10225 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: Rethinking Software Engineering in the Foundation Model Era: From Task-Driven AI Copilots to Goal-Driven AI Pair ProgrammersSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The advent of Foundation Models (FMs) and AI-powered copilots has transformed the landscape of software development, offering unprecedented code completion capabilities and enhancing developer productivity. However, the current task-driven nature of these copilots falls short in addressing the broader goals and complexities inherent in software engineering (SE). In this paper, we propose a paradigm shift towards goal-driven AI-powered pair programmers that collaborate with human developers in a more holistic and context-aware manner. We envision AI pair programmers that are goal-driven, human partners, SE-aware, and self-learning. These AI partners engage in iterative, conversation-driven development processes, aligning closely with human goals and facilitating informed decision-making. We discuss the desired attributes of such AI pair programmers and outline key challenges that must be addressed to realize this vision. Ultimately, our work represents a shift from AI-augmented SE to AI-transformed SE by replacing code completion with a collaborative partnership between humans and AI that enhances both productivity and software quality.
- [1458] arXiv:2404.10241 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Vision-and-Language Navigation via Causal LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at this https URL .
- [1459] arXiv:2404.10242 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Masked Autoencoders for Microscopy are Scalable Learners of Cellular BiologyOren Kraus , Kian Kenyon-Dean , Saber Saberian , Maryam Fallah , Peter McLean , Jess Leung , Vasudev Sharma , Ayla Khan , Jia Balakrishnan , Safiye Celik , Dominique Beaini , Maciej Sypetkowski , Chi Vicky Cheng , Kristen Morse , Maureen Makes , Ben Mabey , Berton EarnshawComments: CVPR 2024 Highlight. arXiv admin note: text overlap with arXiv:2309.16064Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond.
- [1460] arXiv:2404.10259 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop StrategySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: The widespread use of social media has led to a surge in popularity for automated methods of analyzing public opinion. Supervised methods are adept at text categorization, yet the dynamic nature of social media discussions poses a continual challenge for these techniques due to the constant shifting of the focus. On the other hand, traditional unsupervised methods for extracting themes from public discourse, such as topic modeling, often reveal overarching patterns that might not capture specific nuances. Consequently, a significant portion of research into social media discourse still depends on labor-intensive manual coding techniques and a human-in-the-loop approach, which are both time-consuming and costly. In this work, we study the problem of discovering arguments associated with a specific theme. We propose a generic LLMs-in-the-Loop strategy that leverages the advanced capabilities of Large Language Models (LLMs) to extract latent arguments from social media messaging. To demonstrate our approach, we apply our framework to contentious topics. We use two publicly available datasets: (1) the climate campaigns dataset of 14k Facebook ads with 25 themes and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads with 14 themes. Furthermore, we analyze demographic targeting and the adaptation of messaging based on real-world events.
- [1461] arXiv:2404.10260 (cross-list from q-bio.BM) [ pdf , ps , html , other ]
-
Title: HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New HeightsSubjects: Biomolecules (q-bio.BM) ; Artificial Intelligence (cs.AI)
Abstract: While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex prediction, tasks based on precise protein-protein interaction analysis also face obstacles. In this report, we highlight the ongoing advancements of our protein complex structure prediction model, HelixFold-Multimer, underscoring its enhanced performance. HelixFold-Multimer provides precise predictions for diverse protein complex structures, especially in therapeutic protein interactions. Notably, HelixFold-Multimer achieves remarkable success in antigen-antibody and peptide-protein structure prediction, surpassing AlphaFold-Multimer by several folds. HelixFold-Multimer is now available for public use on the PaddleHelix platform, offering both a general version and an antigen-antibody version. Researchers can conveniently access and utilize this service for their development needs.
- [1462] arXiv:2404.10267 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OneActor: Consistent Character Generation via Cluster-Conditioned GuidanceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.
- [1463] arXiv:2404.10271 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Social Choice for AI Alignment: Dealing with Diverse Human FeedbackVincent Conitzer , Rachel Freedman , Jobst Heitzig , Wesley H. Holliday , Bob M. Jacobs , Nathan Lambert , Milan Mossé , Eric Pacuit , Stuart Russell , Hailey Schoelkopf , Emanuel Tewolde , William S. ZwickerComments: 15 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
Abstract: Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, so that, for example, they refuse to comply with requests for help with committing crimes or with producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about ''collective'' preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.
- [1464] arXiv:2404.10275 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: OptiGrad: A Fair and more Efficient Price Elasticity Optimization via a Gradient Based LearningComments: 17 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
Abstract: This paper presents a novel approach to optimizing profit margins in non-life insurance markets through a gradient descent-based method, targeting three key objectives: 1) maximizing profit margins, 2) ensuring conversion rates, and 3) enforcing fairness criteria such as demographic parity (DP). Traditional pricing optimization, which heavily lean on linear and semi definite programming, encounter challenges in balancing profitability and fairness. These challenges become especially pronounced in situations that necessitate continuous rate adjustments and the incorporation of fairness criteria. Specifically, indirect Ratebook optimization, a widely-used method for new business price setting, relies on predictor models such as XGBoost or GLMs/GAMs to estimate on downstream individually optimized prices. However, this strategy is prone to sequential errors and struggles to effectively manage optimizations for continuous rate scenarios. In practice, to save time actuaries frequently opt for optimization within discrete intervals (e.g., range of [-20\%, +20\%] with fix increments) leading to approximate estimations. Moreover, to circumvent infeasible solutions they often use relaxed constraints leading to suboptimal pricing strategies. The reverse-engineered nature of traditional models complicates the enforcement of fairness and can lead to biased outcomes. Our method addresses these challenges by employing a direct optimization strategy in the continuous space of rates and by embedding fairness through an adversarial predictor model. This innovation not only reduces sequential errors and simplifies the complexities found in traditional models but also directly integrates fairness measures into the commercial premium calculation. We demonstrate improved margin performance and stronger enforcement of fairness highlighting the critical need to evolve existing pricing strategies.
- [1465] arXiv:2404.10281 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: AI-Assisted Writing in Education: Ecosystem Risks and MitigationsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: While the excitement around the capabilities of technological advancements is giving rise to new AI-based writing assistants, the overarching ecosystem plays a crucial role in how they are adopted in educational practice. In this paper, we point to key ecological aspects for consideration. We draw insights from extensive research integrated with practice on a writing feedback tool over 9 years at a university, and we highlight potential risks when these are overlooked. It informs the design of educational writing support tools to be better aligned within broader contexts to balance innovation with practical impact.
- [1466] arXiv:2404.10289 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: The Dearth of the Author in AI-Supported WritingComments: Published as a workshop paper at the In2Writing workshop at CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: We diagnose and briefly discuss the dearth of the author: a condition that arises when AI-based creativity support tools for writing allow users to produce large amounts of text without making a commensurate number of creative decisions, resulting in output that is sparse in expressive intent. We argue that the dearth of the author helps to explain a number of recurring difficulties and anxieties around AI-based writing support tools, but that it also suggests an ambitious new goal for AI-based CSTs.
- [1467] arXiv:2404.10296 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Engineering software 2.0 by interpolating neural networks: unifying training, solving, and calibrationChanwook Park , Sourav Saha , Jiachen Guo , Xiaoyu Xie , Satyajit Mojumder , Miguel A. Bessa , Dong Qian , Wei Chen , Gregory J. Wagner , Jian Cao , Wing Kam LiuComments: 9 pages, 3 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: The evolution of artificial intelligence (AI) and neural network theories has revolutionized the way software is programmed, shifting from a hard-coded series of codes to a vast neural network. However, this transition in engineering software has faced challenges such as data scarcity, multi-modality of data, low model accuracy, and slow inference. Here, we propose a new network based on interpolation theories and tensor decomposition, the interpolating neural network (INN). Instead of interpolating training data, a common notion in computer science, INN interpolates interpolation points in the physical space whose coordinates and values are trainable. It can also extrapolate if the interpolation points reside outside of the range of training data and the interpolation functions have a larger support domain. INN features orders of magnitude fewer trainable parameters, faster training, a smaller memory footprint, and higher model accuracy compared to feed-forward neural networks (FFNN) or physics-informed neural networks (PINN). INN is poised to usher in Engineering Software 2.0, a unified neural network that spans various domains of space, time, parameters, and initial/boundary conditions. This has previously been computationally prohibitive due to the exponentially growing number of trainable parameters, easily exceeding the parameter size of ChatGPT, which is over 1 trillion. INN addresses this challenge by leveraging tensor decomposition and tensor product, with adaptable network architecture.
- [1468] arXiv:2404.10297 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Future Language Modeling from Temporal Document HistoryComments: Accepted by ICLR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Predicting the future is of great interest across many aspects of human activity. Businesses are interested in future trends, traders are interested in future stock prices, and companies are highly interested in future technological breakthroughs. While there are many automated systems for predicting future numerical data, such as weather, stock prices, and demand for products, there is relatively little work in automatically predicting textual data. Humans are interested in textual data predictions because it is a natural format for our consumption, and experts routinely make predictions in a textual format (Christensen et al., 2004; Tetlock & Gardner, 2015; Frick, 2015). However, there has been relatively little formalization of this general problem in the machine learning or natural language processing communities. To address this gap, we introduce the task of future language modeling: probabilistic modeling of texts in the future based on a temporal history of texts. To our knowledge, our work is the first work to formalize the task of predicting the future in this way. We show that it is indeed possible to build future language models that improve upon strong non-temporal language model baselines, opening the door to working on this important, and widely applicable problem.
- [1469] arXiv:2404.10299 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Clustering and Data Augmentation to Improve Accuracy of Sleep Assessment and Sleep Individuality AnalysisSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Recently, growing health awareness, novel methods allow individuals to monitor sleep at home. Utilizing sleep sounds offers advantages over conventional methods like smartwatches, being non-intrusive, and capable of detecting various physiological activities. This study aims to construct a machine learning-based sleep assessment model providing evidence-based assessments, such as poor sleep due to frequent movement during sleep onset. Extracting sleep sound events, deriving latent representations using VAE, clustering with GMM, and training LSTM for subjective sleep assessment achieved a high accuracy of 94.8% in distinguishing sleep satisfaction. Moreover, TimeSHAP revealed differences in impactful sound event types and timings for different individuals.
- [1470] arXiv:2404.10307 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learnable Prompt for Few-Shot Semantic Segmentation in Remote Sensing DomainComments: Accepted to CVPRW 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Few-shot segmentation is a task to segment objects or regions of novel classes within an image given only a few annotated examples. In the generalized setting, the task extends to segment both the base and the novel classes. The main challenge is how to train the model such that the addition of novel classes does not hurt the base classes performance, also known as catastrophic forgetting. To mitigate this issue, we use SegGPT as our base model and train it on the base classes. Then, we use separate learnable prompts to handle predictions for each novel class. To handle various object sizes which typically present in remote sensing domain, we perform patch-based prediction. To address the discontinuities along patch boundaries, we propose a patch-and-stitch technique by re-framing the problem as an image inpainting task. During inference, we also utilize image similarity search over image embeddings for prompt selection and novel class filtering to reduce false positive predictions. Based on our experiments, our proposed method boosts the weighted mIoU of a simple fine-tuned SegGPT from 15.96 to 35.08 on the validation set of few-shot OpenEarthMap dataset given in the challenge.
- [1471] arXiv:2404.10308 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMsComments: Accepted to ICLR 2024. The first two authors contributed equallySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at this https URL .
- [1472] arXiv:2404.10311 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Learning and Optimization for Price-based Demand Response of Electric Vehicle ChargingComments: Accepted by American Control Conference (ACC) 2024Subjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI)
Abstract: In the context of charging electric vehicles (EVs), the price-based demand response (PBDR) is becoming increasingly significant for charging load management. Such response usually encourages cost-sensitive customers to adjust their energy demand in response to changes in price for financial incentives. Thus, to model and optimize EV charging, it is important for charging station operator to model the PBDR patterns of EV customers by precisely predicting charging demands given price signals. Then the operator refers to these demands to optimize charging station power allocation policy. The standard pipeline involves offline fitting of a PBDR function based on historical EV charging records, followed by applying estimated EV demands in downstream charging station operation optimization. In this work, we propose a new decision-focused end-to-end framework for PBDR modeling that combines prediction errors and downstream optimization cost errors in the model learning stage. We evaluate the effectiveness of our method on a simulation of charging station operation with synthetic PBDR patterns of EV customers, and experimental results demonstrate that this framework can provide a more reliable prediction model for the ultimate optimization process, leading to more effective optimization solutions in terms of cost savings and charging station operation objectives with only a few training samples.
- [1473] arXiv:2404.10320 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: CARE to Compare: A real-world dataset for anomaly detection in wind turbine dataComments: 28 pages, 3 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Anomaly detection plays a crucial role in the field of predictive maintenance for wind turbines, yet the comparison of different algorithms poses a difficult task because domain specific public datasets are scarce. Many comparisons of different approaches either use benchmarks composed of data from many different domains, inaccessible data or one of the few publicly available datasets which lack detailed information about the faults. Moreover, many publications highlight a couple of case studies where fault detection was successful. With this paper we publish a high quality dataset that contains data from 36 wind turbines across 3 different wind farms as well as the most detailed fault information of any public wind turbine dataset as far as we know. The new dataset contains 89 years worth of real-world operating data of wind turbines, distributed across 44 labeled time frames for anomalies that led up to faults, as well as 51 time series representing normal behavior. Additionally, the quality of training data is ensured by turbine-status-based labels for each data point. Furthermore, we propose a new scoring method, called CARE (Coverage, Accuracy, Reliability and Earliness), which takes advantage of the information depth that is present in the dataset to identify a good all-around anomaly detection model. This score considers the anomaly detection performance, the ability to recognize normal behavior properly and the capability to raise as few false alarms as possible while simultaneously detecting anomalies early.
- [1474] arXiv:2404.10332 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction TuningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Despite achieving outstanding performance on various cross-modal tasks, current large vision-language models (LVLMs) still suffer from hallucination issues, manifesting as inconsistencies between their generated responses and the corresponding images. Prior research has implicated that the low quality of instruction data, particularly the skewed balance between positive and negative samples, is a significant contributor to model hallucinations. Recently, researchers have proposed high-quality instruction datasets, such as LRV-Instruction, to mitigate model hallucination. Nonetheless, our investigation reveals that hallucinatory concepts from different LVLMs exhibit specificity, i.e. the distribution of hallucinatory concepts varies significantly across models. Existing datasets did not consider the hallucination specificity of different models in the design processes, thereby diminishing their efficacy in mitigating model hallucination. In this paper, we propose a targeted instruction data generation framework named DFTG that tailored to the hallucination specificity of different models. Concretely, DFTG consists of two stages: hallucination diagnosis, which extracts the necessary information from the model's responses and images for hallucination diagnosis; and targeted data generation, which generates targeted instruction data based on diagnostic results. The experimental results on hallucination benchmarks demonstrate that the targeted instruction data generated by our method are more effective in mitigating hallucinations compared to previous datasets.
- [1475] arXiv:2404.10356 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generating Counterfactual Trajectories with Latent Diffusion Models for Concept DiscoveryComments: Submitted to International Conference on Pattern Recognition (ICPR) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Trustworthiness is a major prerequisite for the safe application of opaque deep learning models in high-stakes domains like medicine. Understanding the decision-making process not only contributes to fostering trust but might also reveal previously unknown decision criteria of complex models that could advance the state of medical research. The discovery of decision-relevant concepts from black box models is a particularly challenging task. This study proposes Concept Discovery through Latent Diffusion-based Counterfactual Trajectories (CDCT), a novel three-step framework for concept discovery leveraging the superior image synthesis capabilities of diffusion models. In the first step, CDCT uses a Latent Diffusion Model (LDM) to generate a counterfactual trajectory dataset. This dataset is used to derive a disentangled representation of classification-relevant concepts using a Variational Autoencoder (VAE). Finally, a search algorithm is applied to identify relevant concepts in the disentangled latent space. The application of CDCT to a classifier trained on the largest public skin lesion dataset revealed not only the presence of several biases but also meaningful biomarkers. Moreover, the counterfactuals generated within CDCT show better FID scores than those produced by a previously established state-of-the-art method, while being 12 times more resource-efficient. Unsupervised concept discovery holds great potential for the application of trustworthy AI and the further development of human knowledge in various domains. CDCT represents a further step in this direction.
- [1476] arXiv:2404.10378 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Second Edition FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic DataIvan DeAndres-Tame , Ruben Tolosana , Pietro Melzi , Ruben Vera-Rodriguez , Minchul Kim , Christian Rathgeb , Xiaoming Liu , Aythami Morales , Julian Fierrez , Javier Ortega-Garcia , Zhizhou Zhong , Yuge Huang , Yuxi Mi , Shouhong Ding , Shuigeng Zhou , Shuai He , Lingzhi Fu , Heng Cong , Rongyu Zhang , Zhihong Xiao , Evgeny Smirnov , Anton Pimenov , Aleksei Grigorev , Denis Timoshenko , Kaleb Mesfin Asfaw , Cheng Yaw Low , Hao Liu , Chuyi Wang , Qing Zuo , Zhixiang He , Hatef Otroshi Shahreza , Anjith George , Alexander Unnervik , Parsa Rahimi , Sébastien Marcel , Pedro C. Neto , Marco Huber , Jan Niklas Kolf , Naser Damer , Fadi Boutros , Jaime S. Cardoso , Ana F. Sequeira , Andrea Atzori , Gianni Fenu , Mirko Marras , Vitomir Štruc , Jiang Yu , Zhangjie Li , Jichun Li , Weisong Zhao , Zhen Lei , Xiangyu Zhu , Xiao-Yu Zhang , Bernardo Biesseck , Pedro Vidal , Luiz Coelho , Roger Granada , David MenottiComments: arXiv admin note: text overlap with arXiv:2311.10476Journal-ref: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRw 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Synthetic data is gaining increasing relevance for training machine learning models. This is mainly motivated due to several factors such as the lack of real data and intra-class variability, time and errors produced in manual labeling, and in some cases privacy concerns, among others. This paper presents an overview of the 2nd edition of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at CVPR 2024. FRCSyn aims to investigate the use of synthetic data in face recognition to address current technological limitations, including data privacy concerns, demographic biases, generalization to novel scenarios, and performance constraints in challenging situations such as aging, pose variations, and occlusions. Unlike the 1st edition, in which synthetic data from DCFace and GANDiffFace methods was only allowed to train face recognition systems, in this 2nd edition we propose new sub-tasks that allow participants to explore novel face generative methods. The outcomes of the 2nd FRCSyn Challenge, along with the proposed experimental protocol and benchmarking contribute significantly to the application of synthetic data to face recognition.
- [1477] arXiv:2404.10384 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question AnsweringSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Large language models (LLMs), such as GPT3.5, GPT4 and LLAMA2 perform surprisingly well and outperform human experts on many tasks. However, in many domain-specific evaluations, these LLMs often suffer from hallucination problems due to insufficient training of relevant corpus. Furthermore, fine-tuning large models may face problems such as the LLMs are not open source or the construction of high-quality domain instruction is difficult. Therefore, structured knowledge databases such as knowledge graph can better provide domain background knowledge for LLMs and make full use of the reasoning and analysis capabilities of LLMs. In some previous works, LLM was called multiple times to determine whether the current triplet was suitable for inclusion in the subgraph when retrieving subgraphs through a question. Especially for the question that require a multi-hop reasoning path, frequent calls to LLM will consume a lot of computing power. Moreover, when choosing the reasoning path, LLM will be called once for each step, and if one of the steps is selected incorrectly, it will lead to the accumulation of errors in the following steps. In this paper, we integrated and optimized a pipeline for selecting reasoning paths from KG based on LLM, which can reduce the dependency on LLM. In addition, we propose a simple and effective subgraph retrieval method based on chain of thought (CoT) and page rank which can returns the paths most likely to contain the answer. We conduct experiments on three datasets: GenMedGPT-5k [14], WebQuestions [2], and CMCQA [21]. Finally, RoK can demonstrate that using fewer LLM calls can achieve the same results as previous SOTAs models.
- [1478] arXiv:2404.10386 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: I/O in Machine Learning Applications on HPC Systems: A 360-degree SurveySubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: High-Performance Computing (HPC) systems excel in managing distributed workloads, and the growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. In the past, research on HPC I/O focused on optimizing the underlying storage system for modeling and simulation applications and checkpointing the results, causing writes to be the dominant I/O operation. These applications typically access large portions of the data written by simulations or experiments. ML workloads, in contrast, perform small I/O reads spread across a large number of random files. This shift of I/O access patterns poses several challenges to HPC storage systems. In this paper, we survey I/O in ML applications on HPC systems, and target literature within a 6-year time window from 2019 to 2024. We provide an overview of the common phases of ML, review available profilers and benchmarks, examine the I/O patterns encountered during ML training, explore I/O optimizations utilized in modern ML frameworks and proposed in recent literature, and lastly, present gaps requiring further R&D. We seek to summarize the common practices used in accessing data by ML applications and expose research gaps that could spawn further R&D.
- [1479] arXiv:2404.10393 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Offline Trajectory Generalization for Offline Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline reinforcement learning (RL) aims to learn policies from static datasets of previously collected trajectories. Existing methods for offline RL either constrain the learned policy to the support of offline data or utilize model-based virtual environments to generate simulated rollouts. However, these methods suffer from (i) poor generalization to unseen states; and (ii) trivial improvement from low-qualified rollout simulation. In this paper, we propose offline trajectory generalization through world transformers for offline reinforcement learning (OTTO). Specifically, we use casual Transformers, a.k.a. World Transformers, to predict state dynamics and the immediate reward. Then we propose four strategies to use World Transformers to generate high-rewarded trajectory simulation by perturbing the offline data. Finally, we jointly use offline data with simulated data to train an offline RL algorithm. OTTO serves as a plug-in module and can be integrated with existing offline RL methods to enhance them with better generalization capability of transformers and high-rewarded data augmentation. Conducting extensive experiments on D4RL benchmark datasets, we verify that OTTO significantly outperforms state-of-the-art offline RL methods.
- [1480] arXiv:2404.10405 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Integration of Self-Supervised BYOL in Semi-Supervised Medical Image RecognitionComments: Accepted by ICCS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Image recognition techniques heavily rely on abundant labeled data, particularly in medical contexts. Addressing the challenges associated with obtaining labeled data has led to the prominence of self-supervised learning and semi-supervised learning, especially in scenarios with limited annotated data. In this paper, we proposed an innovative approach by integrating self-supervised learning into semi-supervised models to enhance medical image recognition. Our methodology commences with pre-training on unlabeled data utilizing the BYOL method. Subsequently, we merge pseudo-labeled and labeled datasets to construct a neural network classifier, refining it through iterative fine-tuning. Experimental results on three different datasets demonstrate that our approach optimally leverages unlabeled data, outperforming existing methods in terms of accuracy for medical image recognition.
- [1481] arXiv:2404.10425 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Optimizing BioTac Simulation for Realistic Tactile PerceptionComments: 12 pages (including appendix), Accepted at the International Joint Conference on Neural Network (IJCNN) 2024, Yokohama, Japan. \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media... (We refer to IEEE Copyrights)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Tactile sensing presents a promising opportunity for enhancing the interaction capabilities of today's robots. BioTac is a commonly used tactile sensor that enables robots to perceive and respond to physical tactile stimuli. However, the sensor's non-linearity poses challenges in simulating its behavior. In this paper, we first investigate a BioTac simulation that uses temperature, force, and contact point positions to predict the sensor outputs. We show that training with BioTac temperature readings does not yield accurate sensor output predictions during deployment. Consequently, we tested three alternative models, i.e., an XGBoost regressor, a neural network, and a transformer encoder. We train these models without temperature readings and provide a detailed investigation of the window size of the input vectors. We demonstrate that we achieve statistically significant improvements over the baseline network. Furthermore, our results reveal that the XGBoost regressor and transformer outperform traditional feed-forward neural networks in this task. We make all our code and results available online on this https URL .
- [1482] arXiv:2404.10433 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Explainable concept mappings of MRI: Revealing the mechanisms underlying deep learning-based brain disease classificationChristian Tinauer , Anna Damulina , Maximilian Sackl , Martin Soellradl , Reduan Achtibat , Maximilian Dreyer , Frederik Pahde , Sebastian Lapuschkin , Reinhold Schmidt , Stefan Ropele , Wojciech Samek , Christian LangkammerSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Motivation. While recent studies show high accuracy in the classification of Alzheimer's disease using deep neural networks, the underlying learned concepts have not been investigated.
Goals. To systematically identify changes in brain regions through concepts learned by the deep neural network for model validation.
Approach. Using quantitative R2* maps we separated Alzheimer's patients (n=117) from normal controls (n=219) by using a convolutional neural network and systematically investigated the learned concepts using Concept Relevance Propagation and compared these results to a conventional region of interest-based analysis.
Results. In line with established histological findings and the region of interest-based analyses, highly relevant concepts were primarily found in and adjacent to the basal ganglia.
Impact. The identification of concepts learned by deep neural networks for disease classification enables validation of the models and could potentially improve reliability. - [1483] arXiv:2404.10443 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: AGHINT: Attribute-Guided Representation Learning on Heterogeneous Information Networks with TransformerComments: 9 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Recently, heterogeneous graph neural networks (HGNNs) have achieved impressive success in representation learning by capturing long-range dependencies and heterogeneity at the node level. However, few existing studies have delved into the utilization of node attributes in heterogeneous information networks (HINs). In this paper, we investigate the impact of inter-node attribute disparities on HGNNs performance within the benchmark task, i.e., node classification, and empirically find that typical models exhibit significant performance decline when classifying nodes whose attributes markedly differ from their neighbors. To alleviate this issue, we propose a novel Attribute-Guided heterogeneous Information Networks representation learning model with Transformer (AGHINT), which allows a more effective aggregation of neighbor node information under the guidance of attributes. Specifically, AGHINT transcends the constraints of the original graph structure by directly integrating higher-order similar neighbor features into the learning process and modifies the message-passing mechanism between nodes based on their attribute disparities. Extensive experimental results on three real-world heterogeneous graph benchmarks with target node attributes demonstrate that AGHINT outperforms the state-of-the-art.
- [1484] arXiv:2404.10445 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SparseDM: Toward Sparse Efficient Diffusion ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Diffusion models have been extensively used in data generation tasks and are recognized as one of the best generative models. However, their time-consuming deployment, long inference time, and requirements on large memory limit their application on mobile devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then use design progressive sparsity for model training in the fine-tuning stage, and switch the inference mask on and off, which supports a flexible choice of sparsity during inference according to the FID and MACs requirements. Experiments on four datasets conducted on a state-of-the-art Transformer-based diffusion model demonstrate that our method reduces MACs by $50\%$ while increasing FID by only 1.5 on average. Under other MACs conditions, the FID is also lower than 1$\sim$137 compared to other methods.
- [1485] arXiv:2404.10454 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Computer Vision-Based Quality Assessment Technique for the automatic control of consumables for analytical laboratoriesComments: 31 pages, 13 figures, 10 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The rapid growth of the Industry 4.0 paradigm is increasing the pressure to develop effective automated monitoring systems. Artificial Intelligence (AI) is a convenient tool to improve the efficiency of industrial processes while reducing errors and waste. In fact, it allows the use of real-time data to increase the effectiveness of monitoring systems, minimize errors, make the production process more sustainable, and save costs. In this paper, a novel automatic monitoring system is proposed in the context of production process of plastic consumables used in analysis laboratories, with the aim to increase the effectiveness of the control process currently performed by a human operator. In particular, we considered the problem of classifying the presence or absence of a transparent anticoagulant substance inside test tubes. Specifically, a hand-designed deep network model is used and compared with some state-of-the-art models for its ability to categorize different images of vials that can be either filled with the anticoagulant or empty. Collected results indicate that the proposed approach is competitive with state-of-the-art models in terms of accuracy. Furthermore, we increased the complexity of the task by training the models on the ability to discriminate not only the presence or absence of the anticoagulant inside the vial, but also the size of the test tube. The analysis performed in the latter scenario confirms the competitiveness of our approach. Moreover, our model is remarkably superior in terms of its generalization ability and requires significantly fewer resources. These results suggest the possibility of successfully implementing such a model in the production process of a plastic consumables company.
- [1486] arXiv:2404.10458 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Advancing Long-Term Multi-Energy Load Forecasting with Patchformer: A Patch and Transformer-Based ApproachSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In the context of increasing demands for long-term multi-energy load forecasting in real-world applications, this paper introduces Patchformer, a novel model that integrates patch embedding with encoder-decoder Transformer-based architectures. To address the limitation in existing Transformer-based models, which struggle with intricate temporal patterns in long-term forecasting, Patchformer employs patch embedding, which predicts multivariate time-series data by separating it into multiple univariate data and segmenting each of them into multiple patches. This method effectively enhances the model's ability to capture local and global semantic dependencies. The numerical analysis shows that the Patchformer obtains overall better prediction accuracy in both multivariate and univariate long-term forecasting on the novel Multi-Energy dataset and other benchmark datasets. In addition, the positive effect of the interdependence among energy-related products on the performance of long-term time-series forecasting across Patchformer and other compared models is discovered, and the superiority of the Patchformer against other models is also demonstrated, which presents a significant advancement in handling the interdependence and complexities of long-term multi-energy forecasting. Lastly, Patchformer is illustrated as the only model that follows the positive correlation between model performance and the length of the past sequence, which states its ability to capture long-range past local semantic information.
- [1487] arXiv:2404.10464 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation FusionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving fine-tuning or auxiliary models usually require extensive memory and computational resources, rendering them less practical for deployment in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxififies LMs by altering their internal representations in the activation space with lower resource and time cost. Specifically, we leverage self-induced steering pairs to identify detoxification vectors through arithmetic operations in the activation space. During inference, detoxification is achieved by blending the detoxification vectors with the original representations. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on popular detoxification metrics, while also maintaining satisfactory generation quality and diversity. Furthermore, we extend our method to multiple LLMs, demonstrating its practicality and scalability. We open-source our method at this https URL . Warning: Some example model outputs contain highly offensive or disturbing text.
- [1488] arXiv:2404.10499 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Robust Noisy Label Learning via Two-Stream Sample DistillationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Noisy label learning aims to learn robust networks under the supervision of noisy labels, which plays a critical role in deep learning. Existing work either conducts sample selection or label correction to deal with noisy labels during the model training process. In this paper, we design a simple yet effective sample selection framework, termed Two-Stream Sample Distillation (TSSD), for noisy label learning, which can extract more high-quality samples with clean labels to improve the robustness of network training. Firstly, a novel Parallel Sample Division (PSD) module is designed to generate a certain training set with sufficient reliable positive and negative samples by jointly considering the sample structure in feature space and the human prior in loss space. Secondly, a novel Meta Sample Purification (MSP) module is further designed to mine adequate semi-hard samples from the remaining uncertain training set by learning a strong meta classifier with extra golden data. As a result, more and more high-quality samples will be distilled from the noisy training set to train networks robustly in every iteration. Extensive experiments on four benchmark datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and Clothing-1M, show that our method has achieved state-of-the-art results over its competitors.
- [1489] arXiv:2404.10500 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: When Emotional Stimuli meet Prompt Designing: An Auto-Prompt Graphical ParadigmComments: 9 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With the development of Large Language Models (LLM), numerous prompts have been proposed, each with a rich set of features and their own merits. This paper summarizes the prompt words for large language models (LLMs), categorizing them into stimulating and framework types, and proposes an Auto-Prompt Graphical Paradigm(APGP) that combines both stimulating and framework prompts to enhance the problem-solving capabilities of LLMs across multiple domains, then exemplifies it with a framework that adheres to this paradigm. The framework involves automated prompt generation and consideration of emotion-stimulus factors, guiding LLMs in problem abstraction, diversified solutions generation, comprehensive optimization, and self-verification after providing answers, ensuring solution accuracy. Compared to traditional stimuli and framework prompts, this framework integrates the advantages of both by adopting automated approaches inspired by APE work, overcoming the limitations of manually designed prompts. Test results on the ruozhiba and BBH datasets demonstrate that this framework can effectively improve the efficiency and accuracy of LLMs in problem-solving, paving the way for new applications of LLMs.
- [1490] arXiv:2404.10501 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Self-Supervised Visual Preference AlignmentSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code will be available.
- [1491] arXiv:2404.10503 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Sentiment Analysis of Medical Text Based on Deep LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The field of natural language processing (NLP) has made significant progress with the rapid development of deep learning technologies. One of the research directions in text sentiment analysis is sentiment analysis of medical texts, which holds great potential for application in clinical diagnosis. However, the medical field currently lacks sufficient text datasets, and the effectiveness of sentiment analysis is greatly impacted by different model design approaches, which presents challenges. Therefore, this paper focuses on the medical domain, using bidirectional encoder representations from transformers (BERT) as the basic pre-trained model and experimenting with modules such as convolutional neural network (CNN), fully connected network (FCN), and graph convolutional networks (GCN) at the output layer. Experiments and analyses were conducted on the METS-CoV dataset to explore the training performance after integrating different deep learning networks. The results indicate that CNN models outperform other networks when trained on smaller medical text datasets in combination with pre-trained models like BERT. This study highlights the significance of model selection in achieving effective sentiment analysis in the medical domain and provides a reference for future research to develop more efficient model architectures.
- [1492] arXiv:2404.10508 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: White Men Lead, Black Women Help: Uncovering Gender, Racial, and Intersectional Bias in Language AgencySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Social biases can manifest in language agency. For instance, White individuals and men are often described as "agentic" and achievement-oriented, whereas Black individuals and women are frequently described as "communal" and as assisting roles. This study establishes agency as an important aspect of studying social biases in both human-written and Large Language Model (LLM)-generated texts. To accurately measure "language agency" at sentence level, we propose a Language Agency Classification dataset to train reliable agency classifiers. We then use an agency classifier to reveal notable language agency biases in 6 datasets of human- or LLM-written texts, including biographies, professor reviews, and reference letters. While most prior NLP research on agency biases focused on single dimensions, we comprehensively explore language agency biases in gender, race, and intersectional identities. We observe that (1) language agency biases in human-written texts align with real-world social observations; (2) LLM-generated texts demonstrate remarkably higher levels of language agency bias than human-written texts; and (3) critical biases in language agency target people of minority groups -- for instance, languages used to describe Black females exhibit the lowest level of agency across datasets. Our findings reveal intricate social biases in human- and LLM-written texts through the lens of language agency, warning against using LLM generations in social contexts without scrutiny.
- [1493] arXiv:2404.10513 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level GranularitySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: State-of-the-art performance in QA tasks is currently achieved by systems employing Large Language Models (LLMs), however these models tend to hallucinate information in their responses. One approach focuses on enhancing the generation process by incorporating attribution from the given input to the output. However, the challenge of identifying appropriate attributions and verifying their accuracy against a source is a complex task that requires significant improvements in assessing such systems. We introduce an attribution-oriented Chain-of-Thought reasoning method to enhance the accuracy of attributions. This approach focuses the reasoning process on generating an attribution-centric output. Evaluations on two context-enhanced question-answering datasets using GPT-4 demonstrate improved accuracy and correctness of attributions. In addition, the combination of our method with finetuning enhances the response and attribution accuracy of two smaller LLMs, showing their potential to outperform GPT-4 in some cases.
- [1494] arXiv:2404.10534 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Into the Fog: Evaluating Multiple Object Tracking RobustnessSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a pipeline for physic-based volumetric fog simulation in arbitrary real-world MOT dataset utilizing frame-by-frame monocular depth estimation and a fog formation optical model. Moreover, we enhance our simulation by rendering of both homogeneous and heterogeneous fog effects. We propose to use the dark channel prior method to estimate fog (smoke) color, which shows promising results even in night and indoor scenes. We present the leading tracking benchmark MOTChallenge (MOT17 dataset) overlaid by fog (smoke for indoor scenes) of various intensity levels and conduct a comprehensive evaluation of SOTA MOT methods, revealing their limitations under fog and fog-similar challenges.
- [1495] arXiv:2404.10539 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VideoSAGE: Video Summarization with Graph Representation LearningComments: arXiv admin note: text overlap with arXiv:2207.07783Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precisely classifying video frames whether they should belong to the output summary video. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck. Experiments on two datasets(SumMe and TVSum) demonstrate the effectiveness of the proposed nimble model compared to existing state-of-the-art summarization approaches while being one order of magnitude more efficient in compute time and memory
- [1496] arXiv:2404.10552 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Unveiling the Misuse Potential of Base Large Language Models via In-Context LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. This includes both base models, which are pre-trained on extensive datasets without alignment, and aligned models, deliberately designed to align with ethical standards and human values. Contrary to the prevalent assumption that the inherent instruction-following limitations of base LLMs serve as a safeguard against misuse, our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions. To systematically assess these risks, we introduce a novel set of risk evaluation metrics. Empirical results reveal that the outputs from base LLMs can exhibit risk levels on par with those of models fine-tuned for malicious purposes. This vulnerability, requiring neither specialized knowledge nor training, can be manipulated by almost anyone, highlighting the substantial risk and the critical need for immediate attention to the base LLMs' security protocols.
- [1497] arXiv:2404.10574 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Uncertainty-guided Open-Set Source-Free Unsupervised Domain Adaptation with Target-private Class SegregationMattia Litrico , Davide Talon , Sebastiano Battiato , Alessio Del Bue , Mario Valerio Giuffrida , Pietro MorerioSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Standard Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target but usually requires simultaneous access to both source and target data. Moreover, UDA approaches commonly assume that source and target domains share the same labels space. Yet, these two assumptions are hardly satisfied in real-world scenarios. This paper considers the more challenging Source-Free Open-set Domain Adaptation (SF-OSDA) setting, where both assumptions are dropped. We propose a novel approach for SF-OSDA that exploits the granularity of target-private categories by segregating their samples into multiple unknown classes. Starting from an initial clustering-based assignment, our method progressively improves the segregation of target-private samples by refining their pseudo-labels with the guide of an uncertainty-based sample selection module. Additionally, we propose a novel contrastive loss, named NL-InfoNCELoss, that, integrating negative learning into self-supervised contrastive learning, enhances the model robustness to noisy pseudo-labels. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed method over existing approaches, establishing new state-of-the-art performance. Notably, additional analyses show that our method is able to learn the underlying semantics of novel classes, opening the possibility to perform novel class discovery.
- [1498] arXiv:2404.10575 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: EMC$^2$: Efficient MCMC Negative Sampling for Contrastive Learning with Global ConvergenceComments: 20 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
Abstract: A key challenge in contrastive learning is to generate negative samples from a large sample set to contrast with positive samples, for learning better encoding of the data. These negative samples often follow a softmax distribution which are dynamically updated during the training process. However, sampling from this distribution is non-trivial due to the high computational costs in computing the partition function. In this paper, we propose an Efficient Markov Chain Monte Carlo negative sampling method for Contrastive learning (EMC$^2$). We follow the global contrastive learning loss as introduced in SogCLR, and propose EMC$^2$ which utilizes an adaptive Metropolis-Hastings subroutine to generate hardness-aware negative samples in an online fashion during the optimization. We prove that EMC$^2$ finds an $\mathcal{O}(1/\sqrt{T})$-stationary point of the global contrastive loss in $T$ iterations. Compared to prior works, EMC$^2$ is the first algorithm that exhibits global convergence (to stationarity) regardless of the choice of batch size while exhibiting low computation and memory cost. Numerical experiments validate that EMC$^2$ is effective with small batch training and achieves comparable or better performance than baseline algorithms. We report the results for pre-training image encoders on STL-10 and Imagenet-100.
- [1499] arXiv:2404.10591 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Learning Symbolic Task Representation from a Human-Led Demonstration: A Memory to Store, Retrieve, Consolidate, and Forget ExperiencesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Logic in Computer Science (cs.LO)
Abstract: We present a symbolic learning framework inspired by cognitive-like memory functionalities (i.e., storing, retrieving, consolidating and forgetting) to generate task representations to support high-level task planning and knowledge bootstrapping. We address a scenario involving a non-expert human, who performs a single task demonstration, and a robot, which online learns structured knowledge to re-execute the task based on experiences, i.e., observations. We consider a one-shot learning process based on non-annotated data to store an intelligible representation of the task, which can be refined through interaction, e.g., via verbal or visual communication. Our general-purpose framework relies on fuzzy Description Logic, which has been used to extend the previously developed Scene Identification and Tagging algorithm. In this paper, we exploit such an algorithm to implement cognitive-like memory functionalities employing scores that rank memorised observations over time based on simple heuristics. Our main contribution is the formalisation of a framework that can be used to systematically investigate different heuristics for bootstrapping hierarchical knowledge representations based on robot observations. Through an illustrative assembly task scenario, the paper presents the performance of our framework to discuss its benefits and limitations.
- [1500] arXiv:2404.10597 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Hardware-aware training of models with synaptic delays for digital event-driven neuromorphic processorsAlberto Patino-Saucedo , Roy Meijer , Amirreza Yousefzadeh , Manil-Dev Gomony , Federico Corradi , Paul Detteter , Laura Garrido-Regife , Bernabe Linares-Barranco , Manolis SifalakisSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Configurable synaptic delays are a basic feature in many neuromorphic neural network hardware accelerators. However, they have been rarely used in model implementations, despite their promising impact on performance and efficiency in tasks that exhibit complex (temporal) dynamics, as it has been unclear how to optimize them. In this work, we propose a framework to train and deploy, in digital neuromorphic hardware, highly performing spiking neural network models (SNNs) where apart from the synaptic weights, the per-synapse delays are also co-optimized. Leveraging spike-based back-propagation-through-time, the training accounts for both platform constraints, such as synaptic weight precision and the total number of parameters per core, as a function of the network size. In addition, a delay pruning technique is used to reduce memory footprint with a low cost in performance. We evaluate trained models in two neuromorphic digital hardware platforms: Intel Loihi and Imec Seneca. Loihi offers synaptic delay support using the so-called Ring-Buffer hardware structure. Seneca does not provide native hardware support for synaptic delays. A second contribution of this paper is therefore a novel area- and memory-efficient hardware structure for acceleration of synaptic delays, which we have integrated in Seneca. The evaluated benchmark involves several models for solving the SHD (Spiking Heidelberg Digits) classification task, where minimal accuracy degradation during the transition from software to hardware is demonstrated. To our knowledge, this is the first work showcasing how to train and deploy hardware-aware models parameterized with synaptic delays, on multicore neuromorphic hardware accelerators.
- [1501] arXiv:2404.10606 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: InfoCon: Concept Discovery with Generative and Discriminative InformativenessComments: 27 pages, 15 figures. Published as a conference paper at ICLR 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We focus on the self-supervised discovery of manipulation concepts that can be adapted and reassembled to address various robotic tasks. We propose that the decision to conceptualize a physical procedure should not depend on how we name it (semantics) but rather on the significance of the informativeness in its representation regarding the low-level physical state and state changes. We model manipulation concepts (discrete symbols) as generative and discriminative goals and derive metrics that can autonomously link them to meaningful sub-trajectories from noisy, unlabeled demonstrations. Specifically, we employ a trainable codebook containing encodings (concepts) capable of synthesizing the end-state of a sub-trajectory given the current state (generative informativeness). Moreover, the encoding corresponding to a particular sub-trajectory should differentiate the state within and outside it and confidently predict the subsequent action based on the gradient of its discriminative score (discriminative informativeness). These metrics, which do not rely on human annotation, can be seamlessly integrated into a VQ-VAE framework, enabling the partitioning of demonstrations into semantically consistent sub-trajectories, fulfilling the purpose of discovering manipulation concepts and the corresponding sub-goal (key) states. We evaluate the effectiveness of the learned concepts by training policies that utilize them as guidance, demonstrating superior performance compared to other baselines. Additionally, our discovered manipulation concepts compare favorably to human-annotated ones while saving much manual effort.
- [1502] arXiv:2404.10636 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: What are human values, and how do we align AI to them?Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.
- [1503] arXiv:2404.10645 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Continuous Control Reinforcement Learning: Distributed Distributional DrQ AlgorithmsComments: 11 pages, 12 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Distributed Distributional DrQ is a model-free and off-policy RL algorithm for continuous control tasks based on the state and observation of the agent, which is an actor-critic method with the data-augmentation and the distributional perspective of critic value function. Aim to learn to control the agent and master some tasks in a high-dimensional continuous space. DrQ-v2 uses DDPG as the backbone and achieves out-performance in various continuous control tasks. Here Distributed Distributional DrQ uses Distributed Distributional DDPG as the backbone, and this modification aims to achieve better performance in some hard continuous control tasks through the better expression ability of distributional value function and distributed actor policies.
- [1504] arXiv:2404.10662 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Continual Offline Reinforcement Learning via Diffusion-based Dual Generative ReplaySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We study continual offline reinforcement learning, a practical paradigm that facilitates forward transfer and mitigates catastrophic forgetting to tackle sequential offline tasks. We propose a dual generative replay framework that retains previous knowledge by concurrent replay of generated pseudo-data. First, we decouple the continual learning policy into a diffusion-based generative behavior model and a multi-head action evaluation model, allowing the policy to inherit distributional expressivity for encompassing a progressive range of diverse behaviors. Second, we train a task-conditioned diffusion model to mimic state distributions of past tasks. Generated states are paired with corresponding responses from the behavior generator to represent old tasks with high-fidelity replayed samples. Finally, by interleaving pseudo samples with real ones of the new task, we continually update the state and behavior generators to model progressively diverse behaviors, and regularize the multi-head critic via behavior cloning to mitigate forgetting. Experiments demonstrate that our method achieves better forward transfer with less forgetting, and closely approximates the results of using previous ground-truth data due to its high-fidelity replay of the sample space. Our code is available at \href{ this https URL }{ this https URL }.
- [1505] arXiv:2404.10679 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: HSVI-based Online Minimax Strategies for Partially Observable Stochastic Games with Neural Perception MechanismsComments: 12 pages, 2 figuresSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: We consider a variant of continuous-state partially-observable stochastic games with neural perception mechanisms and an asymmetric information structure. One agent has partial information, with the observation function implemented as a neural network, while the other agent is assumed to have full knowledge of the state. We present, for the first time, an efficient online method to compute an $\varepsilon$-minimax strategy profile, which requires only one linear program to be solved for each agent at every stage, instead of a complex estimation of opponent counterfactual values. For the partially-informed agent, we propose a continual resolving approach which uses lower bounds, pre-computed offline with heuristic search value iteration (HSVI), instead of opponent counterfactual values. This inherits the soundness of continual resolving at the cost of pre-computing the bound. For the fully-informed agent, we propose an inferred-belief strategy, where the agent maintains an inferred belief about the belief of the partially-informed agent based on (offline) upper bounds from HSVI, guaranteeing $\varepsilon$-distance to the value of the game at the initial belief known to both agents.
- [1506] arXiv:2404.10704 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Question Difficulty Ranking for Multiple-Choice Reading ComprehensionComments: 7 pages, 3 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multiple-choice (MC) tests are an efficient method to assess English learners. It is useful for test creators to rank candidate MC questions by difficulty during exam curation. Typically, the difficulty is determined by having human test takers trial the questions in a pretesting stage. However, this is expensive and not scalable. Therefore, we explore automated approaches to rank MC questions by difficulty. However, there is limited data for explicit training of a system for difficulty scores. Hence, we compare task transfer and zero-shot approaches: task transfer adapts level classification and reading comprehension systems for difficulty ranking while zero-shot prompting of instruction finetuned language models contrasts absolute assessment against comparative. It is found that level classification transfers better than reading comprehension. Additionally, zero-shot comparative assessment is more effective at difficulty ranking than the absolute assessment and even the task transfer approaches at question difficulty ranking with a Spearman's correlation of 40.4%. Combining the systems is observed to further boost the correlation.
- [1507] arXiv:2404.10717 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Mixed Prototype Consistency Learning for Semi-supervised Medical Image SegmentationComments: 15 pages, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recently, prototype learning has emerged in semi-supervised medical image segmentation and achieved remarkable performance. However, the scarcity of labeled data limits the expressiveness of prototypes in previous methods, potentially hindering the complete representation of prototypes for class embedding. To address this problem, we propose the Mixed Prototype Consistency Learning (MPCL) framework, which includes a Mean Teacher and an auxiliary network. The Mean Teacher generates prototypes for labeled and unlabeled data, while the auxiliary network produces additional prototypes for mixed data processed by CutMix. Through prototype fusion, mixed prototypes provide extra semantic information to both labeled and unlabeled prototypes. High-quality global prototypes for each class are formed by fusing two enhanced prototypes, optimizing the distribution of hidden embeddings used in consistency learning. Extensive experiments on the left atrium and type B aortic dissection datasets demonstrate MPCL's superiority over previous state-of-the-art approaches, confirming the effectiveness of our framework. The code will be released soon.
- [1508] arXiv:2404.10730 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Insight Gained from Migrating a Machine Learning Model to Intelligence Processing UnitsHieu Le , Zhenhua He , Mai Le , Dhruva K. Chakravorty , Lisa M. Perez , Akhil Chilumuru , Yan Yao , Jiefu ChenComments: arXiv admin note: This version has been removed by arXiv administrators as the submitter did not have the right to agree to the license at the time of submissionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The discoveries in this paper show that Intelligence Processing Units (IPUs) offer a viable accelerator alternative to GPUs for machine learning (ML) applications within the fields of materials science and battery research. We investigate the process of migrating a model from GPU to IPU and explore several optimization techniques, including pipelining and gradient accumulation, aimed at enhancing the performance of IPU-based models. Furthermore, we have effectively migrated a specialized model to the IPU platform. This model is employed for predicting effective conductivity, a parameter crucial in ion transport processes, which govern the performance of multiple charge and discharge cycles of batteries. The model utilizes a Convolutional Neural Network (CNN) architecture to perform prediction tasks for effective conductivity. The performance of this model on the IPU is found to be comparable to its execution on GPUs. We also analyze the utilization and performance of Graphcore's Bow IPU. Through benchmark tests, we observe significantly improved performance with the Bow IPU when compared to its predecessor, the Colossus IPU.
- [1509] arXiv:2404.10774 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MiniCheck: Efficient Fact-Checking of LLMs on Grounding DocumentsComments: LLM-AggreFact benchmark, MiniCheck models, data generation code at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
- [1510] arXiv:2404.10775 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: COMBO: Compositional World Models for Embodied Multi-Agent CooperationHongxin Zhang , Zeyuan Wang , Qiushi Lyu , Zheyuan Zhang , Sunli Chen , Tianmin Shu , Yilun Du , Chuang GanComments: 23 pages. The first three authors contributed equallySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. To evaluate the efficacy of our methods, we create two challenging embodied multi-agent long-horizon cooperation tasks using the ThreeDWorld simulator and conduct experiments with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed framework. More videos can be found at this https URL .
- [1511] arXiv:2404.10782 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Quantifying AI Vulnerabilities: A Synthesis of Complexity, Dynamical Systems, and Game TheoryComments: 18 pagesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The rapid integration of Artificial Intelligence (AI) systems across critical domains necessitates robust security evaluation frameworks. We propose a novel approach that introduces three metrics: System Complexity Index (SCI), Lyapunov Exponent for AI Stability (LEAIS), and Nash Equilibrium Robustness (NER). SCI quantifies the inherent complexity of an AI system, LEAIS captures its stability and sensitivity to perturbations, and NER evaluates its strategic robustness against adversarial manipulation. Through comparative analysis, we demonstrate the advantages of our framework over existing techniques. We discuss the theoretical and practical implications, potential applications, limitations, and future research directions. Our work contributes to the development of secure and trustworthy AI technologies by providing a holistic, theoretically grounded approach to AI security evaluation. As AI continues to advance, prioritising and advancing AI security through interdisciplinary collaboration is crucial to ensure its responsible deployment for the benefit of society.
- [1512] arXiv:2404.10786 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Sustainability of Data Center Digital Twins with Reinforcement LearningSoumyendu Sarkar , Avisek Naug , Antonio Guillen , Ricardo Luna , Vineet Gundecha , Ashwin Ramesh Babu , Sajad MousaviComments: 2024 Proceedings of the AAAI Conference on Artificial IntelligenceJournal-ref: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 20, pp. 22322-22330, Mar. 2024Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract: The rapid growth of machine learning (ML) has led to an increased demand for computational power, resulting in larger data centers (DCs) and higher energy consumption. To address this issue and reduce carbon emissions, intelligent design and control of DC components such as IT servers, cabinets, HVAC cooling, flexible load shifting, and battery energy storage are essential. However, the complexity of designing and controlling them in tandem presents a significant challenge. While some individual components like CFD-based design and Reinforcement Learning (RL) based HVAC control have been researched, there's a gap in the holistic design and optimization covering all elements simultaneously. To tackle this, we've developed DCRL-Green, a multi-agent RL environment that empowers the ML community to design data centers and research, develop, and refine RL controllers for carbon footprint reduction in DCs. It is a flexible, modular, scalable, and configurable platform that can handle large High Performance Computing (HPC) clusters. Furthermore, in its default setup, DCRL-Green provides a benchmark for evaluating single as well as multi-agent RL algorithms. It easily allows users to subclass the default implementations and design their own control approaches, encouraging community development for sustainable data centers. Open Source Link: this https URL
- [1513] arXiv:2404.10788 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: The Path To Autonomous Cyber DefenseSean Oesch , Phillipe Austria , Amul Chaulagain , Brian Weber , Cory Watson , Matthew Dixson , Amir SadovnikComments: 9 pages, 3 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Defenders are overwhelmed by the number and scale of attacks against their networks.This problem will only be exacerbated as attackers leverage artificial intelligence to automate their workflows. We propose a path to autonomous cyber agents able to augment defenders by automating critical steps in the cyber defense life cycle.
- [1514] arXiv:2404.10789 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: PASA: Attack Agnostic Unsupervised Adversarial Detection using Prediction & Attribution Sensitivity AnalysisDipkamal Bhusal , Md Tanvirul Alam , Monish K. Veerabhadran , Michael Clifford , Sara Rampazzi , Nidhi RastogiComments: 9th IEEE European Symposium on Security and PrivacySubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep neural networks for classification are vulnerable to adversarial attacks, where small perturbations to input samples lead to incorrect predictions. This susceptibility, combined with the black-box nature of such networks, limits their adoption in critical applications like autonomous driving. Feature-attribution-based explanation methods provide relevance of input features for model predictions on input samples, thus explaining model decisions. However, we observe that both model predictions and feature attributions for input samples are sensitive to noise. We develop a practical method for this characteristic of model prediction and feature attribution to detect adversarial samples. Our method, PASA, requires the computation of two test statistics using model prediction and feature attribution and can reliably detect adversarial samples using thresholds learned from benign samples. We validate our lightweight approach by evaluating the performance of PASA on varying strengths of FGSM, PGD, BIM, and CW attacks on multiple image and non-image datasets. On average, we outperform state-of-the-art statistical unsupervised adversarial detectors on CIFAR-10 and ImageNet by 14\% and 35\% ROC-AUC scores, respectively. Moreover, our approach demonstrates competitive performance even when an adversary is aware of the defense mechanism.
- [1515] arXiv:2404.10790 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Multimodal Attack Detection for Action Recognition ModelsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Adversarial machine learning attacks on video action recognition models is a growing research area and many effective attacks were introduced in recent years. These attacks show that action recognition models can be breached in many ways. Hence using these models in practice raises significant security concerns. However, there are very few works which focus on defending against or detecting attacks. In this work, we propose a novel universal detection method which is compatible with any action recognition model. In our extensive experiments, we show that our method consistently detects various attacks against different target models with high true positive rates while satisfying very low false positive rates. Tested against four state-of-the-art attacks targeting four action recognition models, the proposed detector achieves an average AUC of 0.911 over 16 test cases while the best performance achieved by the existing detectors is 0.645 average AUC. This 41.2% improvement is enabled by the robustness of the proposed detector to varying attack methods and target models. The lowest AUC achieved by our detector across the 16 test cases is 0.837 while the competing detector's performance drops as low as 0.211. We also show that the proposed detector is robust to varying attack strengths. In addition, we analyze our method's real-time performance with different hardware setups to demonstrate its potential as a practical defense mechanism.
- [1516] arXiv:2404.10795 (cross-list from cs.SI) [ pdf , ps , other ]
-
Title: Intelligent Message Behavioral Identification SystemSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
Abstract: On social media platforms, the act of predicting reposting is seen as a challenging issue related to Short Message Services (SMS). This study examines the issue of predicting picture reposting in SMS and forecasts users' behavior in sharing photographs on Twitter. Several research vary. The paper introduces a network called Image Retweet Modeling (IRM) that models heterogeneous image retransmission. It considers the user's previous reposting of the image tweet, the next contact in the SMS, and the preferences of the reposted person. Three aspects connected to content. A text-guided multimodal neural network is developed to create a novel multi-faceted attention ranking network methodology. This allows for learning the joint image Twitter representation and user preference representation in the prediction job. Multiple experiments conducted on extensive data sets demonstrate that our approach outperforms current methods on Social Network platforms.
- [1517] arXiv:2404.10800 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Integrating Graph Neural Networks with Scattering Transform for Anomaly DetectionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we present two novel methods in Network Intrusion Detection Systems (NIDS) using Graph Neural Networks (GNNs). The first approach, Scattering Transform with E-GraphSAGE (STEG), utilizes the scattering transform to conduct multi-resolution analysis of edge feature vectors. This provides a detailed representation that is essential for identifying subtle anomalies in network traffic. The second approach improves node representation by initiating with Node2Vec, diverging from standard methods of using uniform values, thereby capturing a more accurate and holistic network picture. Our methods have shown significant improvements in performance compared to existing state-of-the-art methods in benchmark NIDS datasets.
- [1518] arXiv:2404.10824 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Decoupled Weight Decay for Any $p$ NormComments: GitHub link: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
Abstract: With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.
- [1519] arXiv:2404.10830 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Fewer Truncations Improve Language ModelingHantian Ding , Zijian Wang , Giovanni Paolini , Varun Kumar , Anoop Deoras , Dan Roth , Stefano SoattoComments: ICML 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
- [1520] arXiv:2404.10843 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Geometric Neural Operators (GNPs) for Data-Driven Deep Learning of Non-Euclidean OperatorsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: We introduce Geometric Neural Operators (GNPs) for accounting for geometric contributions in data-driven deep learning of operators. We show how GNPs can be used (i) to estimate geometric properties, such as the metric and curvatures, (ii) to approximate Partial Differential Equations (PDEs) on manifolds, (iii) learn solution maps for Laplace-Beltrami (LB) operators, and (iv) to solve Bayesian inverse problems for identifying manifold shapes. The methods allow for handling geometries of general shape including point-cloud representations. The developed GNPs provide approaches for incorporating the roles of geometry in data-driven learning of operators.
- [1521] arXiv:2404.10848 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich DocumentsWiam Adnan , Joel Tang , Yassine Bel Khayat Zouggari , Seif Edinne Laatiri , Laurent Lam , Fabien CaspaniComments: Accepted at the International Conference on Document Analysis and Recognition (ICDAR 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Document Understanding is an evolving field in Natural Language Processing (NLP). In particular, visual and spatial features are essential in addition to the raw text itself and hence, several multimodal models were developed in the field of Visual Document Understanding (VDU). However, while research is mainly focused on Key Information Extraction (KIE), Relation Extraction (RE) between identified entities is still under-studied. For instance, RE is crucial to regroup entities or obtain a comprehensive hierarchy of data in a document. In this paper, we present a model that, initialized from LayoutLMv3, can match or outperform the current state-of-the-art results in RE applied to Visually-Rich Documents (VRD) on FUNSD and CORD datasets, without any specific pre-training and with fewer parameters. We also report an extensive ablation study performed on FUNSD, highlighting the great impact of certain features and modelization choices on the performances.
- [1522] arXiv:2404.10849 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: End-To-End Training and Testing Gamification Framework to Learn Human Highway DrivingComments: Accepted by ITSCSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The current autonomous stack is well modularized and consists of perception, decision making and control in a handcrafted framework. With the advances in artificial intelligence (AI) and computing resources, researchers have been pushing the development of end-to-end AI for autonomous driving, at least in problems of small searching space such as in highway scenarios, and more and more photorealistic simulation will be critical for efficient learning. In this research, we propose a novel game-based end-to-end learning and testing framework for autonomous vehicle highway driving, by learning from human driving skills. Firstly, we utilize the popular game Grand Theft Auto V (GTA V) to collect highway driving data with our proposed programmable labels. Then, an end-to-end architecture predicts the steering and throttle values that control the vehicle by the image of the game screen. The predicted control values are sent to the game via a virtual controller to keep the vehicle in lane and avoid collisions with other vehicles on the road. The proposed solution is validated in GTA V games, and the results demonstrate the effectiveness of this end-to-end gamification framework for learning human driving skills.
- [1523] arXiv:2404.10880 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HumMUSS: Human Motion Understanding using State Space ModelsComments: CVPR 24Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.
- [1524] arXiv:2404.10896 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: This paper starts with a simple lossless ~1.5:1 compression algorithm for the weights of the Large Language Model (LLM) Llama2 7B [1] that can be implemented in ~200 LUTs in AMD FPGAs, processing over 800 million bfloat16 numbers per second. This framework is then extended to variable precision, variable range, compressed numerical data types that are a user defined super set of both floats and posits [2]. The paper then discusses a simple hardware implementation of such format based on ANS (Asymmetrical Numeral Systems) [3] that acts as a bridge between this flexible data format and a computational engine while, at the same time, achieving bandwidth reduction. An example of a token factory using weight compression and sharing is also given.
- [1525] arXiv:2404.10924 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Binder: Hierarchical Concept Representation through Order Embedding of Binary VectorsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: For natural language understanding and generation, embedding concepts using an order-based representation is an essential task. Unlike traditional point vector based representation, an order-based representation imposes geometric constraints on the representation vectors for explicitly capturing various semantic relationships that may exist between a pair of concepts. In existing literature, several approaches on order-based embedding have been proposed, mostly focusing on capturing hierarchical relationships; examples include vectors in Euclidean space, complex, Hyperbolic, order, and Box Embedding. Box embedding creates region-based rich representation of concepts, but along the process it sacrifices simplicity, requiring a custom-made optimization scheme for learning the representation. Hyperbolic embedding improves embedding quality by exploiting the ever-expanding property of Hyperbolic space, but it also suffers from the same fate as box embedding as gradient descent like optimization is not simple in the Hyperbolic space. In this work, we propose Binder, a novel approach for order-based representation. Binder uses binary vectors for embedding, so the embedding vectors are compact with an order of magnitude smaller footprint than other methods. Binder uses a simple and efficient optimization scheme for learning representation vectors with a linear time complexity. Our comprehensive experimental results show that Binder is very accurate, yielding competitive results on the representation task. But Binder stands out from its competitors on the transitive closure link prediction task as it can learn concept embeddings just from the direct edges, whereas all existing order-based approaches rely on the indirect edges.
- [1526] arXiv:2404.10934 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Shears: Unstructured Sparsity with Neural Low-rank Adapter SearchComments: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Industry Track)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recently, several approaches successfully demonstrated that weight-sharing Neural Architecture Search (NAS) can effectively explore a search space of elastic low-rank adapters (LoRA), allowing the parameter-efficient fine-tuning (PEFT) and compression of large language models. In this paper, we introduce a novel approach called Shears, demonstrating how the integration of cost-effective sparsity and a proposed Neural Low-rank adapter Search (NLS) algorithm can further improve the efficiency of PEFT approaches. Results demonstrate the benefits of Shears compared to other methods, reaching high sparsity levels while improving or with little drop in accuracy, utilizing a single GPU for a pair of hours.
- [1527] arXiv:2404.10942 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: What Hides behind Unfairness? Exploring Dynamics Fairness in Reinforcement LearningComments: 13 pages, 9 figures, accepted by IJCAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Methodology (stat.ME)
Abstract: In sequential decision-making problems involving sensitive attributes like race and gender, reinforcement learning (RL) agents must carefully consider long-term fairness while maximizing returns. Recent works have proposed many different types of fairness notions, but how unfairness arises in RL problems remains unclear. In this paper, we address this gap in the literature by investigating the sources of inequality through a causal lens. We first analyse the causal relationships governing the data generation process and decompose the effect of sensitive attributes on long-term well-being into distinct components. We then introduce a novel notion called dynamics fairness, which explicitly captures the inequality stemming from environmental dynamics, distinguishing it from those induced by decision-making or inherited from the past. This notion requires evaluating the expected changes in the next state and the reward induced by changing the value of the sensitive attribute while holding everything else constant. To quantitatively evaluate this counterfactual concept, we derive identification formulas that allow us to obtain reliable estimations from data. Extensive experiments demonstrate the effectiveness of the proposed techniques in explaining, detecting, and reducing inequality in reinforcement learning. We publicly release code at this https URL .
- [1528] arXiv:2404.10946 (cross-list from q-bio.NC) [ pdf , ps , html , other ]
-
Title: Information encoding and decoding in in-vitro neural networks on micro electrode arrays through stimulation timingComments: 50 pages, 23 figuresSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI)
Abstract: A primary challenge in utilizing in-vitro biological neural networks for computations is finding good encoding and decoding schemes for inputting and decoding data to and from the networks. Furthermore, identifying the optimal parameter settings for a given combination of encoding and decoding schemes adds additional complexity to this challenge. In this study we explore stimulation timing as an encoding method, i.e. we encode information as the delay between stimulation pulses and identify the bounds and acuity of stimulation timings which produce linearly separable spike responses. We also examine the optimal readout parameters for a linear decoder in the form of epoch length, time bin size and epoch offset. Our results suggest that stimulation timings between 36 and 436ms may be optimal for encoding and that different combinations of readout parameters may be optimal at different parts of the evoked spike response.
- [1529] arXiv:2404.10952 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can Language Models Solve Olympiad Programming?Comments: Code and data: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Abstract: Computing olympiads contain some of the most challenging problems for humans, requiring complex algorithmic reasoning, puzzle solving, in addition to generating efficient code. However, it has been understudied as a domain to evaluate language models (LMs). In this paper, we introduce the USACO benchmark with 307 problems from the USA Computing Olympiad, along with high-quality unit tests, reference code, and official analyses for each problem. These resources enable us to construct and test a range of LM inference methods for competitive programming for the first time. We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting, and our best inference method improves it to 20.2% using a combination of self-reflection and retrieval over episodic knowledge. However, this is far from solving the benchmark. To better understand the remaining challenges, we design a novel human-in-the-loop study and surprisingly find that a small number of targeted hints enable GPT-4 to solve 13 out of 15 problems previously unsolvable by any model and method. Our benchmark, baseline methods, quantitative results, and qualitative analysis serve as an initial step toward LMs with grounded, creative, and algorithmic reasoning.
- [1530] arXiv:2404.10960 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Uncertainty-Based Abstention in LLMs Improves Safety and Reduces HallucinationsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: A major barrier towards the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety. In all three cases, models should ideally abstain from responding, much like humans, whose ability to understand uncertainty makes us refrain from answering questions we don't know. Inspired by analogous approaches in classification, this study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering. We investigate two kinds of uncertainties, statistical uncertainty metrics and a distinct verbalized measure, termed as In-Dialogue Uncertainty (InDU). Using these uncertainty measures combined with models with and without Reinforcement Learning with Human Feedback (RLHF), we show that in all three situations, abstention based on the right kind of uncertainty measure can boost the reliability of LLMs. By sacrificing only a few highly uncertain samples we can improve correctness by 2% to 8%, avoid 50% hallucinations via correctly identifying unanswerable questions and increase safety by 70% up to 99% with almost no additional computational overhead.
- [1531] arXiv:2404.10976 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Group-Aware Coordination Graph for Multi-Agent Reinforcement LearningComments: Accepted by IJCAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) necessitates seamless collaboration among agents, often represented by an underlying relation graph. Existing methods for learning this graph primarily focus on agent-pair relations, neglecting higher-order relationships. While several approaches attempt to extend cooperation modelling to encompass behaviour similarities within groups, they commonly fall short in concurrently learning the latent graph, thereby constraining the information exchange among partially observed agents. To overcome these limitations, we present a novel approach to infer the Group-Aware Coordination Graph (GACG), which is designed to capture both the cooperation between agent pairs based on current observations and group-level dependencies from behaviour patterns observed across trajectories. This graph is further used in graph convolution for information exchange between agents during decision-making. To further ensure behavioural consistency among agents within the same group, we introduce a group distance loss, which promotes group cohesion and encourages specialization between groups. Our evaluations, conducted on StarCraft II micromanagement tasks, demonstrate GACG's superior performance. An ablation study further provides experimental evidence of the effectiveness of each component of our method.
- [1532] arXiv:2404.10978 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Leveraging 3D LiDAR Sensors to Enable Enhanced Urban Safety and Public Health: Pedestrian Monitoring and Abnormal Activity DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The integration of Light Detection and Ranging (LiDAR) and Internet of Things (IoT) technologies offers transformative opportunities for public health informatics in urban safety and pedestrian well-being. This paper proposes a novel framework utilizing these technologies for enhanced 3D object detection and activity classification in urban traffic scenarios. By employing elevated LiDAR, we obtain detailed 3D point cloud data, enabling precise pedestrian activity monitoring. To overcome urban data scarcity, we create a specialized dataset through simulated traffic environments in Blender, facilitating targeted model training. Our approach employs a modified Point Voxel-Region-based Convolutional Neural Network (PV-RCNN) for robust 3D detection and PointNet for classifying pedestrian activities, significantly benefiting urban traffic management and public health by offering insights into pedestrian behavior and promoting safer urban environments. Our dual-model approach not only enhances urban traffic management but also contributes significantly to public health by providing insights into pedestrian behavior and promoting safer urban environment.
- [1533] arXiv:2404.10981 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Survey on Retrieval-Augmented Text Generation for Large Language ModelsComments: Ongoing workSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.
- [1534] arXiv:2404.11008 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AKGNet: Attribute Knowledge-Guided Unsupervised Lung-Infected Area SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Lung-infected area segmentation is crucial for assessing the severity of lung diseases. However, existing image-text multi-modal methods typically rely on labour-intensive annotations for model training, posing challenges regarding time and expertise. To address this issue, we propose a novel attribute knowledge-guided framework for unsupervised lung-infected area segmentation (AKGNet), which achieves segmentation solely based on image-text data without any mask annotation. AKGNet facilitates text attribute knowledge learning, attribute-image cross-attention fusion, and high-confidence-based pseudo-label exploration simultaneously. It can learn statistical information and capture spatial correlations between image and text attributes in the embedding space, iteratively refining the mask to enhance segmentation. Specifically, we introduce a text attribute knowledge learning module by extracting attribute knowledge and incorporating it into feature representations, enabling the model to learn statistical information and adapt to different attributes. Moreover, we devise an attribute-image cross-attention module by calculating the correlation between attributes and images in the embedding space to capture spatial dependency information, thus selectively focusing on relevant regions while filtering irrelevant areas. Finally, a self-training mask improvement process is employed by generating pseudo-labels using high-confidence predictions to iteratively enhance the mask and segmentation. Experimental results on a benchmark medical image dataset demonstrate the superior performance of our method compared to state-of-the-art segmentation techniques in unsupervised scenarios.
- [1535] arXiv:2404.11014 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Towards Multi-agent Reinforcement Learning based Traffic Signal Control through Spatio-temporal HypergraphsSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI)
Abstract: Traffic signal control systems (TSCSs) are integral to intelligent traffic management, fostering efficient vehicle flow. Traditional approaches often simplify road networks into standard graphs, which results in a failure to consider the dynamic nature of traffic data at neighboring intersections, thereby neglecting higher-order interconnections necessary for real-time control. To address this, we propose a novel TSCS framework to realize intelligent traffic control. This framework collaborates with multiple neighboring edge computing servers to collect traffic information across the road network. To elevate the efficiency of traffic signal control, we have crafted a multi-agent soft actor-critic (MA-SAC) reinforcement learning algorithm. Within this algorithm, individual agents are deployed at each intersection with a mandate to optimize traffic flow across the entire road network collectively. Furthermore, we introduce hypergraph learning into the critic network of MA-SAC to enable the spatio-temporal interactions from multiple intersections in the road network. This method fuses hypergraph and spatio-temporal graph structures to encode traffic data and capture the complex spatial and temporal correlations between multiple intersections. Our empirical evaluation, tested on varied datasets, demonstrates the superiority of our framework in minimizing average vehicle travel times and sustaining high-throughput performance. This work facilitates the development of more intelligent and reactive urban traffic management solutions.
- [1536] arXiv:2404.11015 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedFa: A Fully Asynchronous Training Paradigm for Federated LearningJournal-ref: IJCAI 2024: the 33rd International Joint Conference on Artificial IntelligenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated learning has been identified as an efficient decentralized training paradigm for scaling the machine learning model training on a large number of devices while guaranteeing the data privacy of the trainers. FedAvg has become a foundational parameter update strategy for federated learning, which has been promising to eliminate the effect of the heterogeneous data across clients and guarantee convergence. However, the synchronization parameter update barriers for each communication round during the training significant time on waiting, slowing down the training procedure. Therefore, recent state-of-the-art solutions propose using semi-asynchronous approaches to mitigate the waiting time cost with guaranteed convergence. Nevertheless, emerging semi-asynchronous approaches are unable to eliminate the waiting time completely.
We propose a full asynchronous training paradigm, called FedFa, which can guarantee model convergence and eliminate the waiting time completely for federated learning by using a few buffered results on the server for parameter updating. Further, we provide theoretical proof of the convergence rate for our proposed FedFa. Extensive experimental results indicate our approach effectively improves the training performance of federated learning by up to 6x and 4x speedup compared to the state-of-the-art synchronous and semi-asynchronous strategies while retaining high accuracy in both IID and Non-IID scenarios. - [1537] arXiv:2404.11016 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided TrainingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this research, we introduce MaeFuse, a novel autoencoder model designed for infrared and visible image fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain high-level visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. MaeFuse, however, deviates from the norm. Instead of being driven by downstream tasks, our model utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion effect. It facilitates the comprehensive integration of feature vectors from both infrared and visible modalities, preserving the rich details inherent in each. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.
- [1538] arXiv:2404.11018 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Many-Shot In-Context LearningRishabh Agarwal , Avi Singh , Lei M. Zhang , Bernd Bohnet , Stephanie Chan , Ankesh Anand , Zaheer Abbas , Azade Nova , John D. Co-Reyes , Eric Chu , Feryal Behbahani , Aleksandra Faust , Hugo LarochelleSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases and can learn high-dimensional functions with numerical inputs. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
- [1539] arXiv:2404.11049 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Stepwise Alignment for Constrained Language Model Policy OptimizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Safety and trustworthiness are indispensable requirements for applying AI systems based on large language models (LLMs) in real-world applications. This paper formulates a human value alignment as a language model policy optimization problem to maximize reward under a safety constraint and then proposes an algorithm called Stepwise Alignment for Constrained Policy Optimization (SACPO). A key idea behind SACPO, supported by theory, is that the optimal policy incorporating both reward and safety can be directly obtained from a reward-aligned policy. Based on this key idea, SACPO aligns the LLMs with each metric step-wise while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO provides many benefits such as simplicity, stability, computational efficiency, and flexibility regarding algorithms and dataset selection. Under mild assumption, our theoretical analysis provides the upper bounds regarding near-optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness
- [1540] arXiv:2404.11056 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LMEraser: Large Model Unlearning through Adaptive Prompt TuningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: To address the growing demand for privacy protection in machine learning, we propose a novel and efficient machine unlearning approach for \textbf{L}arge \textbf{M}odels, called \textbf{LM}Eraser. Existing unlearning research suffers from entangled training data and complex model architectures, incurring extremely high computational costs for large models. LMEraser takes a divide-and-conquer strategy with a prompt tuning architecture to isolate data influence. The training dataset is partitioned into public and private datasets. Public data are used to train the backbone of the model. Private data are adaptively clustered based on their diversity, and each cluster is used to optimize a prompt separately. This adaptive prompt tuning mechanism reduces unlearning costs and maintains model performance. Experiments demonstrate that LMEraser achieves a $100$-fold reduction in unlearning costs without compromising accuracy compared to prior work. Our code is available at: \url{ this https URL }.
- [1541] arXiv:2404.11064 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based LocalizationYongdong Luo , Haojia Lin , Xiawu Zheng , Yigeng Jiang , Fei Chao , Jie Hu , Guannan Jiang , Songan Zhang , Rongrong JiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: 3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.
- [1542] arXiv:2404.11068 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ScaleFold: Reducing AlphaFold Initial Training Time to 10 HoursFeiwen Zhu , Arkadiusz Nowaczynski , Rundong Li , Jie Xin , Yifei Song , Michal Marcinkiewicz , Sukru Burc Eryilmaz , Jun Yang , Michael AnderschSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Quantitative Methods (q-bio.QM)
Abstract: AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.
- [1543] arXiv:2404.11072 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Large Language Models Meet User Interfaces: The Case of Provisioning FeedbackStanislav Pozdniakov , Jonathan Brazil , Solmaz Abdi , Aneesha Bakharia , Shazia Sadiq , Dragan Gasevic , Paul Denny , Hassan KhosraviComments: submission to C&E AISubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Incorporating Generative AI (GenAI) and Large Language Models (LLMs) in education can enhance teaching efficiency and enrich student learning. Current LLM usage involves conversational user interfaces (CUIs) for tasks like generating materials or providing feedback. However, this presents challenges including the need for educator expertise in AI and CUIs, ethical concerns with high-stakes decisions, and privacy risks. CUIs also struggle with complex tasks. To address these, we propose transitioning from CUIs to user-friendly applications leveraging LLMs via API calls. We present a framework for ethically incorporating GenAI into educational tools and demonstrate its application in our tool, Feedback Copilot, which provides personalized feedback on student assignments. Our evaluation shows the effectiveness of this approach, with implications for GenAI researchers, educators, and technologists. This work charts a course for the future of GenAI in education.
- [1544] arXiv:2404.11075 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: EEG_GLT-Net: Optimising EEG Graphs for Real-time Motor Imagery Signals ClassificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Brain-Computer Interfaces connect the brain to external control devices, necessitating the accurate translation of brain signals such as from electroencephalography (EEG) into executable commands. Graph Neural Networks (GCN) have been increasingly applied for classifying EEG Motor Imagery signals, primarily because they incorporates the spatial relationships among EEG channels, resulting in improved accuracy over traditional convolutional methods. Recent advances by GCNs-Net in real-time EEG MI signal classification utilised Pearson Coefficient Correlation (PCC) for constructing adjacency matrices, yielding significant results on the PhysioNet dataset. Our paper introduces the EEG Graph Lottery Ticket (EEG_GLT) algorithm, an innovative technique for constructing adjacency matrices for EEG channels. It does not require pre-existing knowledge of inter-channel relationships, and it can be tailored to suit both individual subjects and GCN model architectures. Our findings demonstrated that the PCC method outperformed the Geodesic approach by 9.65% in mean accuracy, while our EEG_GLT matrix consistently exceeded the performance of the PCC method by a mean accuracy of 13.39%. Also, we found that the construction of the adjacency matrix significantly influenced accuracy, to a greater extent than GCN model configurations. A basic GCN configuration utilising our EEG_GLT matrix exceeded the performance of even the most complex GCN setup with a PCC matrix in average accuracy. Our EEG_GLT method also reduced MACs by up to 97% compared to the PCC method, while maintaining or enhancing accuracy. In conclusion, the EEG_GLT algorithm marks a breakthrough in the development of optimal adjacency matrices, effectively boosting both computational accuracy and efficiency, making it well-suited for real-time classification of EEG MI signals that demand intensive computational resources.
- [1545] arXiv:2404.11086 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language ModelsComments: arXiv admin note: text overlap with arXiv:2305.08322 by other authorsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapid advancement of large language models (LLMs) necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users. This paper provides a thorough overview of ViLLM-Eval as part of the Vietnamese Large Language Model shared task, held within the 10th International Workshop on Vietnamese Language and Speech Processing (VLSP 2023).
- [1546] arXiv:2404.11095 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Inductive-Deductive Strategy Reuse for Multi-Turn Instructional DialoguesComments: 27 pages, 3 figures, 12 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Aligning large language models (LLMs) with human expectations requires high-quality instructional dialogues, which can be achieved by raising diverse, in-depth, and insightful instructions that deepen interactions. Existing methods target instructions from real instruction dialogues as a learning goal and fine-tune a user simulator for posing instructions. However, the user simulator struggles to implicitly model complex dialogue flows and pose high-quality instructions. In this paper, we take inspiration from the cognitive abilities inherent in human learning and propose the explicit modeling of complex dialogue flows through instructional strategy reuse. Specifically, we first induce high-level strategies from various real instruction dialogues. These strategies are applied to new dialogue scenarios deductively, where the instructional strategies facilitate high-quality instructions. Experimental results show that our method can generate diverse, in-depth, and insightful instructions for a given dialogue history. The constructed multi-turn instructional dialogues can outperform competitive baselines on the downstream chat model.
- [1547] arXiv:2404.11116 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Music Enhancement with Deep Filters: A Technical Report for The ICASSP 2024 Cadenza ChallengeComments: 2 pages, 2 figures, 1 tables, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Abstract: In this challenge, we disentangle the deep filters from the original DeepfilterNet and incorporate them into our Spec-UNet-based network to further improve a hybrid Demucs (hdemucs) based remixing pipeline. The motivation behind the use of the deep filter component lies at its potential in better handling temporal fine structures. We demonstrate an incremental improvement in both the Signal-to-Distortion Ratio (SDR) and the Hearing Aid Audio Quality Index (HAAQI) metrics when comparing the performance of hdemucs against different versions of our model.
- [1548] arXiv:2404.11121 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge DeploymentComments: arXiv admin note: text overlap with arXiv:2310.07152 by other authorsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Proprietary large language models (LLMs) have been widely applied in various scenarios. Additionally, deploying LLMs on edge devices is trending for efficiency and privacy reasons. However, edge deployment of proprietary LLMs introduces new security challenges: edge-deployed models are exposed as white-box accessible to users, enabling adversaries to conduct effective model stealing (MS) attacks. Unfortunately, existing defense mechanisms fail to provide effective protection. Specifically, we identify four critical protection properties that existing methods fail to simultaneously satisfy: (1) maintaining protection after a model is physically copied; (2) authorizing model access at request level; (3) safeguarding runtime reverse engineering; (4) achieving high security with negligible runtime overhead. To address the above issues, we propose TransLinkGuard, a plug-and-play model protection approach against model stealing on edge devices. The core part of TransLinkGuard is a lightweight authorization module residing in a secure environment, e.g., TEE. The authorization module can freshly authorize each request based on its input. Extensive experiments show that TransLinkGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
- [1549] arXiv:2404.11124 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: What's under the hood: Investigating Automatic Metrics on Meeting SummarizationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores with human evaluations across a broad error taxonomy. We commence with a comprehensive literature review on English meeting summarization to define key challenges like speaker dynamics and contextual turn-taking and error types such as missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We examine the relationship between characteristic challenges and errors by using annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset. Through experimental validation, we find that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations, while a third of the correlations show trends of error masking. Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.
- [1550] arXiv:2404.11171 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Personalized Heart Disease Detection via ECG Digital Twin GenerationYaojun Hu , Jintai Chen , Lianting Hu , Dantong Li , Jiahuan Yan , Haochao Ying , Huiying Liang , Jian WuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Heart diseases rank among the leading causes of global mortality, demonstrating a crucial need for early diagnosis and intervention. Most traditional electrocardiogram (ECG) based automated diagnosis methods are trained at population level, neglecting the customization of personalized ECGs to enhance individual healthcare management. A potential solution to address this limitation is to employ digital twins to simulate symptoms of diseases in real patients. In this paper, we present an innovative prospective learning approach for personalized heart disease detection, which generates digital twins of healthy individuals' anomalous ECGs and enhances the model sensitivity to the personalized symptoms. In our approach, a vector quantized feature separator is proposed to locate and isolate the disease symptom and normal segments in ECG signals with ECG report guidance. Thus, the ECG digital twins can simulate specific heart diseases used to train a personalized heart disease detection model. Experiments demonstrate that our approach not only excels in generating high-fidelity ECG signals but also improves personalized heart disease detection. Moreover, our approach ensures robust privacy protection, safeguarding patient data in model development.
- [1551] arXiv:2404.11172 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Neural Networks via Complex Network Theory: a PerspectiveComments: IJCAI'24 (full paper, main track)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep Neural Networks (DNNs) can be represented as graphs whose links and vertices iteratively process data and solve tasks sub-optimally. Complex Network Theory (CNT), merging statistical physics with graph theory, provides a method for interpreting neural networks by analysing their weights and neuron structures. However, classic works adapt CNT metrics that only permit a topological analysis as they do not account for the effect of the input data. In addition, CNT metrics have been applied to a limited range of architectures, mainly including Fully Connected neural networks. In this work, we extend the existing CNT metrics with measures that sample from the DNNs' training distribution, shifting from a purely topological analysis to one that connects with the interpretability of deep learning. For the novel metrics, in addition to the existing ones, we provide a mathematical formalisation for Fully Connected, AutoEncoder, Convolutional and Recurrent neural networks, of which we vary the activation functions and the number of hidden layers. We show that these metrics differentiate DNNs based on the architecture, the number of hidden layers, and the activation function. Our contribution provides a method rooted in physics for interpreting DNNs that offers insights beyond the traditional input-output relationship and the CNT topological analysis.
- [1552] arXiv:2404.11181 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: KI-GAN: Knowledge-Informed Generative Adversarial Networks for Enhanced Multi-Vehicle Trajectory Forecasting at Signalized IntersectionsComments: 2024 CVPR AICity WorkshopSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Reliable prediction of vehicle trajectories at signalized intersections is crucial to urban traffic management and autonomous driving systems. However, it presents unique challenges, due to the complex roadway layout at intersections, involvement of traffic signal controls, and interactions among different types of road users. To address these issues, we present in this paper a novel model called Knowledge-Informed Generative Adversarial Network (KI-GAN), which integrates both traffic signal information and multi-vehicle interactions to predict vehicle trajectories accurately. Additionally, we propose a specialized attention pooling method that accounts for vehicle orientation and proximity at intersections. Based on the SinD dataset, our KI-GAN model is able to achieve an Average Displacement Error (ADE) of 0.05 and a Final Displacement Error (FDE) of 0.12 for a 6-second observation and 6-second prediction cycle. When the prediction window is extended to 9 seconds, the ADE and FDE values are further reduced to 0.11 and 0.26, respectively. These results demonstrate the effectiveness of the proposed KI-GAN model in vehicle trajectory prediction under complex scenarios at signalized intersections, which represents a significant advancement in the target field.
- [1553] arXiv:2404.11207 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring the Transferability of Visual Prompting for Multimodal Large Language ModelsComments: Accepted in CVPR 2024 as Poster (Highlight)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
- [1554] arXiv:2404.11213 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Revisiting Noise Resilience Strategies in Gesture Recognition: Short-Term Enhancement in Surface Electromyographic Signal AnalysisSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM) which can be easily integrated with various models. STEM offers several benefits: 1) Learnable denoise, enabling noise reduction without manual data augmentation; 2) Scalability, adaptable to various models; and 3) Cost-effectiveness, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We also report promising results on both classification and regression datasets and demonstrate that STEM generalizes across different gesture recognition tasks.
- [1555] arXiv:2404.11214 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Feature Corrective Transfer Learning: End-to-End Solutions to Object Detection in Non-Ideal Visual ConditionsComments: 2024 CVPR UG2+ WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: A significant challenge in the field of object detection lies in the system's performance under non-ideal imaging conditions, such as rain, fog, low illumination, or raw Bayer images that lack ISP processing. Our study introduces "Feature Corrective Transfer Learning", a novel approach that leverages transfer learning and a bespoke loss function to facilitate the end-to-end detection of objects in these challenging scenarios without the need to convert non-ideal images into their RGB counterparts. In our methodology, we initially train a comprehensive model on a pristine RGB image dataset. Subsequently, non-ideal images are processed by comparing their feature maps against those from the initial ideal RGB model. This comparison employs the Extended Area Novel Structural Discrepancy Loss (EANSDL), a novel loss function designed to quantify similarities and integrate them into the detection loss. This approach refines the model's ability to perform object detection across varying conditions through direct feature map correction, encapsulating the essence of Feature Corrective Transfer Learning. Experimental validation on variants of the KITTI dataset demonstrates a significant improvement in mean Average Precision (mAP), resulting in a 3.8-8.1% relative enhancement in detection under non-ideal conditions compared to the baseline model, and a less marginal performance difference within 1.3% of the mAP@[0.5:0.95] achieved under ideal conditions by the standard Faster RCNN algorithm.
- [1556] arXiv:2404.11216 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Position Engineering: Boosting Large Language Models through Positional Information ManipulationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The performance of large language models (LLMs) is significantly influenced by the quality of the prompts provided. In response, researchers have developed enormous prompt engineering strategies aimed at modifying the prompt text to enhance task performance. In this paper, we introduce a novel technique termed position engineering, which offers a more efficient way to guide large language models. Unlike prompt engineering, which requires substantial effort to modify the text provided to LLMs, position engineering merely involves altering the positional information in the prompt without modifying the text itself. We have evaluated position engineering in two widely-used LLM scenarios: retrieval-augmented generation (RAG) and in-context learning (ICL). Our findings show that position engineering substantially improves upon the baseline in both cases. Position engineering thus represents a promising new strategy for exploiting the capabilities of large language models.
- [1557] arXiv:2404.11225 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: In-Context Learning State Vector with Inner and Momentum OptimizationComments: 17 pages, 7 figures, 5 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have exhibited an impressive ability to perform In-Context Learning (ICL) from only a few examples. Recent works have indicated that the functions learned by ICL can be represented through compressed vectors derived from the transformer. However, the working mechanisms and optimization of these vectors are yet to be thoroughly explored. In this paper, we address this gap by presenting a comprehensive analysis of these compressed vectors, drawing parallels to the parameters trained with gradient descent, and introduce the concept of state vector. Inspired by the works on model soup and momentum-based gradient descent, we propose inner and momentum optimization methods that are applied to refine the state vector progressively as test-time adaptation. Moreover, we simulate state vector aggregation in the multiple example setting, where demonstrations comprising numerous examples are usually too lengthy for regular ICL, and further propose a divide-and-conquer aggregation method to address this challenge. We conduct extensive experiments using Llama-2 and GPT-J in both zero-shot setting and few-shot setting. The experimental results show that our optimization method effectively enhances the state vector and achieves the state-of-the-art performance on diverse tasks. Code is available at this https URL
- [1558] arXiv:2404.11230 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Energy-Efficient Uncertainty-Aware Biomass Composition Prediction at the EdgeComments: The paper has been accepted to CVPR 2024 5th Workshop on Vision for AgricultureSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Clover fixates nitrogen from the atmosphere to the ground, making grass-clover mixtures highly desirable to reduce external nitrogen fertilization. Herbage containing clover additionally promotes higher food intake, resulting in higher milk production. Herbage probing however remains largely unused as it requires a time-intensive manual laboratory analysis. Without this information, farmers are unable to perform localized clover sowing or take targeted fertilization decisions. Deep learning algorithms have been proposed with the goal to estimate the dry biomass composition from images of the grass directly in the fields. The energy-intensive nature of deep learning however limits deployment to practical edge devices such as smartphones. This paper proposes to fill this gap by applying filter pruning to reduce the energy requirement of existing deep learning solutions. We report that although pruned networks are accurate on controlled, high-quality images of the grass, they struggle to generalize to real-world smartphone images that are blurry or taken from challenging angles. We address this challenge by training filter-pruned models using a variance attenuation loss so they can predict the uncertainty of their predictions. When the uncertainty exceeds a threshold, we re-infer using a more accurate unpruned model. This hybrid approach allows us to reduce energy consumption while retaining a high accuracy. We evaluate our algorithm on two datasets: the GrassClover and the Irish clover using an NVIDIA Jetson Nano edge device. We find that we reduce energy reduction with respect to state-of-the-art solutions by 50% on average with only 4% accuracy loss.
- [1559] arXiv:2404.11243 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Optical Image-to-Image Translation Using Denoising Diffusion Models: Heterogeneous Change Detection as a Use CaseSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We introduce an innovative deep learning-based method that uses a denoising diffusion-based model to translate low-resolution images to high-resolution ones from different optical sensors while preserving the contents and avoiding undesired artifacts. The proposed method is trained and tested on a large and diverse data set of paired Sentinel-II and Planet Dove images. We show that it can solve serious image generation issues observed when the popular classifier-free guided Denoising Diffusion Implicit Model (DDIM) framework is used in the task of Image-to-Image Translation of multi-sensor optical remote sensing images and that it can generate large images with highly consistent patches, both in colors and in features. Moreover, we demonstrate how our method improves heterogeneous change detection results in two urban areas: Beirut, Lebanon, and Austin, USA. Our contributions are: i) a new training and testing algorithm based on denoising diffusion models for optical image translation; ii) a comprehensive image quality evaluation and ablation study; iii) a comparison with the classifier-free guided DDIM framework; and iv) change detection experiments on heterogeneous data.
- [1560] arXiv:2404.11269 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DACAD: Domain Adaptation Contrastive Learning for Anomaly Detection in Multivariate Time SeriesComments: 11 pages, 2 figures, 5 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Time series anomaly detection (TAD) faces a significant challenge due to the scarcity of labelled data, which hinders the development of accurate detection models. Unsupervised domain adaptation (UDA) addresses this challenge by leveraging a labelled dataset from a related domain to detect anomalies in a target dataset. Existing domain adaptation techniques assume that the number of anomalous classes does not change between the source and target domains. In this paper, we propose a novel Domain Adaptation Contrastive learning for Anomaly Detection in multivariate time series (DACAD) model to address this issue by combining UDA and contrastive representation learning. DACAD's approach includes an anomaly injection mechanism that introduces various types of synthetic anomalies, enhancing the model's ability to generalise across unseen anomalous classes in different domains. This method significantly broadens the model's adaptability and robustness. Additionally, we propose a supervised contrastive loss for the source domain and a self-supervised contrastive triplet loss for the target domain, improving comprehensive feature representation learning and extraction of domain-invariant features. Finally, an effective Centre-based Entropy Classifier (CEC) is proposed specifically for anomaly detection, facilitating accurate learning of normal boundaries in the source domain. Our extensive evaluation across multiple real-world datasets against leading models in time series anomaly detection and UDA underscores DACAD's effectiveness. The results validate DACAD's superiority in transferring knowledge across domains and its potential to mitigate the challenge of limited labelled data in time series anomaly detection.
- [1561] arXiv:2404.11280 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited NetworksComments: 13 pages, 15 figures, this paper has been submitted to IEICE Transactions on CommunicationsSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: To reduce network traffic and support environments with limited resources, a method for transmitting images with low amounts of transmission data is required. Machine learning-based image compression methods, which compress the data size of images while maintaining their features, have been proposed. However, in certain situations, reconstructing a part of semantic information of images at the receiver end may be sufficient. To realize this concept, semantic-information-based communication, called semantic communication, has been proposed, along with an image transmission method using semantic communication. This method transmits only the semantic information of an image, and the receiver reconstructs the image using an image-generation model. This method utilizes one type of semantic information, but reconstructing images similar to the original image using only it is challenging. This study proposes a multi-modal image transmission method that leverages diverse semantic information for efficient semantic communication. The proposed method extracts multi-modal semantic information from an image and transmits only it. Subsequently, the receiver generates multiple images using an image-generation model and selects an output based on semantic similarity. The receiver must select the output based only on the received features; however, evaluating semantic similarity using conventional metrics is challenging. Therefore, this study explored new metrics to evaluate the similarity between semantic features of images and proposes two scoring procedures. The results indicate that the proposed procedures can compare semantic similarities, such as position and composition, between semantic features of the original and generated images. Thus, the proposed method can facilitate the transmission and utilization of photographs through mobile networks for various service applications.
- [1562] arXiv:2404.11313 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and ResultsXin Li , Kun Yuan , Yajing Pei , Yiting Lu , Ming Sun , Chao Zhou , Zhibo Chen , Radu Timofte , Wei Sun , Haoning Wu , Zicheng Zhang , Jun Jia , Zhichao Zhang , Linhan Cao , Qiubo Chen , Xiongkuo Min , Weisi Lin , Guangtao Zhai , Jianhui Sun , Tianyi Wang , Lei Li , Han Kong , Wenxuan Wang , Bing Li , Cheng Luo , Haiqiang Wang , Xiangguang Chen , Wenhui Meng , Xiang Pan , Huiying Shi , Han Zhu , Xiaozhong Xu , Lei Sun , Zhenzhong Chen , Shan Liu , Fangyuan Kong , Haotian Fan , Yifang Xu , Haoran Xu , Mengduo Yang , Jie Zhou , Jiaze Li , Shijie Wen , Mai Xu , Da Li , Shunyu Yao , Jiazhi Du , Wangmeng Zuo , Zhibo Li , Shuai He , Anlong Ming , Huiyuan Fu , Huadong Ma , Yong Wu , Fie Xue , Guozhi Zhao , Lina Du , Jie Guo , Yu Zhang , Huimin Zheng , Junhao Chen , Yue Liu , Dulan Zhou , Kele Xu , Qisheng Xu , Tao Sun , Zhixiang Ding , Yuhang HuComments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment ChallengeSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI)
Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at this https URL .
- [1563] arXiv:2404.11317 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and NegativesComments: 12 pages, 11 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our methods also perform well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario.
- [1564] arXiv:2404.11335 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a MinimapVladimir Somers , Victor Joos , Anthony Cioppa , Silvio Giancola , Seyed Abolfazl Ghasemzadeh , Floriane Magera , Baptiste Standaert , Amir Mohammad Mansourian , Xin Zhou , Shohreh Kasaei , Bernard Ghanem , Alexandre Alahi , Marc Van Droogenbroeck , Christophe De VleeschouwerSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Tracking and identifying athletes on the pitch holds a central role in collecting essential insights from the game, such as estimating the total distance covered by players or understanding team tactics. This tracking and identification process is crucial for reconstructing the game state, defined by the athletes' positions and identities on a 2D top-view of the pitch, (i.e. a minimap). However, reconstructing the game state from videos captured by a single camera is challenging. It requires understanding the position of the athletes and the viewpoint of the camera to localize and identify players within the field. In this work, we formalize the task of Game State Reconstruction and introduce SoccerNet-GSR, a novel Game State Reconstruction dataset focusing on football videos. SoccerNet-GSR is composed of 200 video sequences of 30 seconds, annotated with 9.37 million line points for pitch localization and camera calibration, as well as over 2.36 million athlete positions on the pitch with their respective role, team, and jersey number. Furthermore, we introduce GS-HOTA, a novel metric to evaluate game state reconstruction methods. Finally, we propose and release an end-to-end baseline for game state reconstruction, bootstrapping the research on this task. Our experiments show that GSR is a challenging novel task, which opens the field for future research. Our dataset and codebase are publicly available at this https URL .
- [1565] arXiv:2404.11343 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender SystemComments: Under reviewSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Collaborative filtering recommender systems (CF-RecSys) have shown successive results in enhancing the user experience on social media and e-commerce platforms. However, as CF-RecSys struggles under cold scenarios with sparse user-item interactions, recent strategies have focused on leveraging modality information of user/items (e.g., text or images) based on pre-trained modality encoders and Large Language Models (LLMs). Despite their effectiveness under cold scenarios, we observe that they underperform simple traditional collaborative filtering models under warm scenarios due to the lack of collaborative knowledge. In this work, we propose an efficient All-round LLM-based Recommender system, called A-LLMRec, that excels not only in the cold scenario but also in the warm scenario. Our main idea is to enable an LLM to directly leverage the collaborative knowledge contained in a pre-trained state-of-the-art CF-RecSys so that the emergent ability of the LLM as well as the high-quality user/item embeddings that are already trained by the state-of-the-art CF-RecSys can be jointly exploited. This approach yields two advantages: (1) model-agnostic, allowing for integration with various existing CF-RecSys, and (2) efficiency, eliminating the extensive fine-tuning typically required for LLM-based recommenders. Our extensive experiments on various real-world datasets demonstrate the superiority of A-LLMRec in various scenarios, including cold/warm, few-shot, cold user, and cross-domain scenarios. Beyond the recommendation task, we also show the potential of A-LLMRec in generating natural language outputs based on the understanding of the collaborative knowledge by performing a favorite genre prediction task. Our code is available at this https URL .
- [1566] arXiv:2404.11350 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Calibrating Bayesian Learning via Regularization, Confidence Minimization, and Selective InferenceComments: Under reviewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: The application of artificial intelligence (AI) models in fields such as engineering is limited by the known difficulty of quantifying the reliability of an AI's decision. A well-calibrated AI model must correctly report its accuracy on in-distribution (ID) inputs, while also enabling the detection of out-of-distribution (OOD) inputs. A conventional approach to improve calibration is the application of Bayesian ensembling. However, owing to computational limitations and model misspecification, practical ensembling strategies do not necessarily enhance calibration. This paper proposes an extension of variational inference (VI)-based Bayesian learning that integrates calibration regularization for improved ID performance, confidence minimization for OOD detection, and selective calibration to ensure a synergistic use of calibration regularization and confidence minimization. The scheme is constructed successively by first introducing calibration-regularized Bayesian learning (CBNN), then incorporating out-of-distribution confidence minimization (OCM) to yield CBNN-OCM, and finally integrating also selective calibration to produce selective CBNN-OCM (SCBNN-OCM). Selective calibration rejects inputs for which the calibration performance is expected to be insufficient. Numerical results illustrate the trade-offs between ID accuracy, ID calibration, and OOD calibration attained by both frequentist and Bayesian learning methods. Among the main conclusions, SCBNN-OCM is seen to achieve best ID and OOD performance as compared to existing state-of-the-art approaches at the cost of rejecting a sufficiently large number of inputs.
- [1567] arXiv:2404.11370 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Characterizing and modeling harms from interactions with design patterns in AI interfacesComments: Fixed issue with referencesSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The proliferation of applications using artificial intelligence (AI) systems has led to a growing number of users interacting with these systems through sophisticated interfaces. Human-computer interaction research has long shown that interfaces shape both user behavior and user perception of technical capabilities and risks. Yet, practitioners and researchers evaluating the social and ethical risks of AI systems tend to overlook the impact of anthropomorphic, deceptive, and immersive interfaces on human-AI interactions. Here, we argue that design features of interfaces with adaptive AI systems can have cascading impacts, driven by feedback loops, which extend beyond those previously considered. We first conduct a scoping review of AI interface designs and their negative impact to extract salient themes of potentially harmful design patterns in AI interfaces. Then, we propose Design-Enhanced Control of AI systems (DECAI), a conceptual model to structure and facilitate impact assessments of AI interface designs. DECAI draws on principles from control systems theory -- a theory for the analysis and design of dynamic physical systems -- to dissect the role of the interface in human-AI systems. Through two case studies on recommendation systems and conversational language model systems, we show how DECAI can be used to evaluate AI interface designs.
- [1568] arXiv:2404.11422 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Short-term wind speed forecasting model based on an attention-gated recurrent neural network and error correction strategyComments: 23 pages, 11 figures, 6 tables, Technical ReportSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: The accurate wind speed series forecast is very pivotal to security of grid dispatching and the application of wind power. Nevertheless, on account of their nonlinear and non-stationary nature, their short-term forecast is extremely challenging. Therefore, this dissertation raises one short-term wind speed forecast pattern on the foundation of attention with an improved gated recurrent neural network (AtGRU) and a tactic of error correction. That model uses the AtGRU model as the preliminary predictor and the GRU model as the error corrector. At the beginning, SSA (singular spectrum analysis) is employed in previous wind speed series for lessening the noise. Subsequently, historical wind speed series is going to be used for the predictor training. During this process, the prediction can have certain errors. The sequence of these errors processed by variational modal decomposition (VMD) is used to train the corrector of error. The eventual forecast consequence is just the sum of predictor forecast and error corrector. The proposed SSA-AtGRU-VMD-GRU model outperforms the compared models in three case studies on Woodburn, St. Thomas, and Santa Cruz. It is indicated that the model evidently enhances the correction of the wind speed forecast.
- [1569] arXiv:2404.11446 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Open-Ended Wargames with Large Language ModelsComments: 15 pages, 2 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Wargames are a powerful tool for understanding and rehearsing real-world decision making. Automated play of wargames using artificial intelligence (AI) enables possibilities beyond those of human-conducted games, such as playing the game many times over to see a range of possible outcomes. There are two categories of wargames: quantitative games, with discrete types of moves, and qualitative games, which revolve around open-ended responses. Historically, automation efforts have focused on quantitative games, but large language models (LLMs) make it possible to automate qualitative wargames. We introduce "Snow Globe," an LLM-powered multi-agent system for playing qualitative wargames. With Snow Globe, every stage of a text-based qualitative wargame from scenario preparation to post-game analysis can be optionally carried out by AI, humans, or a combination thereof. We describe its software architecture conceptually and release an open-source implementation alongside this publication. As case studies, we simulate a tabletop exercise about an AI incident response and a political wargame about a geopolitical crisis. We discuss potential applications of the approach and how it fits into the broader wargaming ecosystem.
- [1570] arXiv:2404.11457 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language ModelsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the rapid advancement of large language models (LLMs), information retrieval (IR) systems, such as search engines and recommender systems, have undergone a significant paradigm shift. This evolution, while heralding new opportunities, introduces emerging challenges, particularly in terms of biases and unfairness, which may threaten the information ecosystem. In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of LLMs. We first unify bias and unfairness issues as distribution mismatch problems, providing a groundwork for categorizing various mitigation strategies through distribution alignment. Subsequently, we systematically delve into the specific bias and unfairness issues arising from three critical stages of LLMs integration into IR systems: data collection, model development, and result evaluation. In doing so, we meticulously review and analyze recent literature, focusing on the definitions, characteristics, and corresponding mitigation strategies associated with these issues. Finally, we identify and highlight some open problems and challenges for future work, aiming to inspire researchers and stakeholders in the IR field and beyond to better understand and mitigate bias and unfairness issues of IR in this LLM era. We also consistently maintain a GitHub repository for the relevant papers and resources in this rising direction at this https URL .
- [1571] arXiv:2404.11461 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Using Game Engines and Machine Learning to Create Synthetic Satellite Imagery for a Tabletop Verification ExerciseJohannes Hoster , Sara Al-Sayed , Felix Biessmann , Alexander Glaser , Kristian Hildebrand , Igor Moric , Tuong Vy NguyenComments: Annual Meeting of the Institute of Nuclear Materials Management (INMM), ViennaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Satellite imagery is regarded as a great opportunity for citizen-based monitoring of activities of interest. Relevant imagery may however not be available at sufficiently high resolution, quality, or cadence -- let alone be uniformly accessible to open-source analysts. This limits an assessment of the true long-term potential of citizen-based monitoring of nuclear activities using publicly available satellite imagery. In this article, we demonstrate how modern game engines combined with advanced machine-learning techniques can be used to generate synthetic imagery of sites of interest with the ability to choose relevant parameters upon request; these include time of day, cloud cover, season, or level of activity onsite. At the same time, resolution and off-nadir angle can be adjusted to simulate different characteristics of the satellite. While there are several possible use-cases for synthetic imagery, here we focus on its usefulness to support tabletop exercises in which simple monitoring scenarios can be examined to better understand verification capabilities enabled by new satellite constellations and very short revisit times.
- [1572] arXiv:2404.11475 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AdaIR: Exploiting Underlying Similarities of Image Restoration Tasks with AdaptersSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Existing image restoration approaches typically employ extensive networks specifically trained for designated degradations. Despite being effective, such methods inevitably entail considerable storage costs and computational overheads due to the reliance on task-specific networks. In this work, we go beyond this well-established framework and exploit the inherent commonalities among image restoration tasks. The primary objective is to identify components that are shareable across restoration tasks and augment the shared components with modules specifically trained for individual tasks. Towards this goal, we propose AdaIR, a novel framework that enables low storage cost and efficient training without sacrificing performance. Specifically, a generic restoration network is first constructed through self-supervised pre-training using synthetic degradations. Subsequent to the pre-training phase, adapters are trained to adapt the pre-trained network to specific degradations. AdaIR requires solely the training of lightweight, task-specific modules, ensuring a more efficient storage and training regimen. We have conducted extensive experiments to validate the effectiveness of AdaIR and analyze the influence of the pre-training strategy on discovering shareable components. Extensive experimental results show that AdaIR achieves outstanding results on multi-task restoration while utilizing significantly fewer parameters (1.9 MB) and less training time (7 hours) for each restoration task. The source codes and trained models will be released.
- [1573] arXiv:2404.11477 (cross-list from nucl-th) [ pdf , ps , html , other ]
-
Title: Discovering Nuclear Models from Symbolic Machine LearningSubjects: Nuclear Theory (nucl-th) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex)
Abstract: Numerous phenomenological nuclear models have been proposed to describe specific observables within different regions of the nuclear chart. However, developing a unified model that describes the complex behavior of all nuclei remains an open challenge. Here, we explore whether novel symbolic Machine Learning (ML) can rediscover traditional nuclear physics models or identify alternatives with improved simplicity, fidelity, and predictive power. To address this challenge, we developed a Multi-objective Iterated Symbolic Regression approach that handles symbolic regressions over multiple target observables, accounts for experimental uncertainties and is robust against high-dimensional problems. As a proof of principle, we applied this method to describe the nuclear binding energies and charge radii of light and medium mass nuclei. Our approach identified simple analytical relationships based on the number of protons and neutrons, providing interpretable models with precision comparable to state-of-the-art nuclear models. Additionally, we integrated this ML-discovered model with an existing complementary model to estimate the limits of nuclear stability. These results highlight the potential of symbolic ML to develop accurate nuclear models and guide our description of complex many-body problems.
- [1574] arXiv:2404.11488 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded SystemsComments: 9 pages, 3 figures Accepted for publication at the Embedded Vision Workshop of the Computer Vision and Pattern Recognition conference, Seattle, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces Multi-Resolution Rescored Byte-Track (MR2-ByteTrack), a novel video object detection framework for ultra-low-power embedded processors. This method reduces the average compute load of an off-the-shelf Deep Neural Network (DNN) based object detector by up to 2.25$\times$ by alternating the processing of high-resolution images (320$\times$320 pixels) with multiple down-sized frames (192$\times$192 pixels). To tackle the accuracy degradation due to the reduced image input size, MR2-ByteTrack correlates the output detections over time using the ByteTrack tracker and corrects potential misclassification using a novel probabilistic Rescore algorithm. By interleaving two down-sized images for every high-resolution one as the input of different state-of-the-art DNN object detectors with our MR2-ByteTrack, we demonstrate an average accuracy increase of 2.16% and a latency reduction of 43% on the GAP9 microcontroller compared to a baseline frame-by-frame inference scheme using exclusively full-resolution images. Code available at: this https URL
- [1575] arXiv:2404.11492 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: arcjetCV: an open-source software to analyze material ablationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: arcjetCV is an open-source Python software designed to automate time-resolved measurements of heatshield material recession and recession rates from arcjet test video footage. This new automated and accessible capability greatly exceeds previous manual extraction methods, enabling rapid and detailed characterization of material recession for any sample with a profile video. arcjetCV automates the video segmentation process using machine learning models, including a one-dimensional (1D) Convolutional Neural Network (CNN) to infer the time-window of interest, a two-dimensional (2D) CNN for image and edge segmentation, and a Local Outlier Factor (LOF) for outlier filtering. A graphical user interface (GUI) simplifies the user experience and an application programming interface (API) allows users to call the core functions from scripts, enabling video batch processing. arcjetCV's capability to measure time-resolved recession in turn enables characterization of non-linear processes (shrinkage, swelling, melt flows, etc.), contributing to higher fidelity validation and improved modeling of heatshield material performance. The source code associated with this article can be found at this https URL .
- [1576] arXiv:2404.11496 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Runtime Analysis of Evolutionary Diversity Optimization on the Multi-objective (LeadingOnes, TrailingZeros) ProblemSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: The diversity optimization is the class of optimization problems, in which we aim at finding a diverse set of good solutions. One of the frequently used approaches to solve such problems is to use evolutionary algorithms which evolve a desired diverse population. This approach is called evolutionary diversity optimization (EDO).
In this paper, we analyse EDO on a 3-objective function LOTZ$_k$, which is a modification of the 2-objective benchmark function (LeadingOnes, TrailingZeros). We prove that the GSEMO computes a set of all Pareto-optimal solutions in $O(kn^3)$ expected iterations. We also analyze the runtime of the GSEMO$_D$ (a modification of the GSEMO for diversity optimization) until it finds a population with the best possible diversity for two different diversity measures, the total imbalance and the sorted imbalances vector. For the first measure we show that the GSEMO$_D$ optimizes it asymptotically faster than it finds a Pareto-optimal population, in $O(kn^2\log(n))$ expected iterations, and for the second measure we show an upper bound of $O(k^2n^3\log(n))$ expected iterations. We complement our theoretical analysis with an empirical study, which shows a very similar behavior for both diversity measures that is close to the theory predictions. - [1577] arXiv:2404.11499 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Data-Driven Representation for Sign Language ProductionComments: 8 Pages, 3 Figures, 7 Tables, 18th IEEE International Conference on Automatic Face and Gesture Recognition 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Phonetic representations are used when recording spoken languages, but no equivalent exists for recording signed languages. As a result, linguists have proposed several annotation systems that operate on the gloss or sub-unit level; however, these resources are notably irregular and scarce.
Sign Language Production (SLP) aims to automatically translate spoken language sentences into continuous sequences of sign language. However, current state-of-the-art approaches rely on scarce linguistic resources to work. This has limited progress in the field. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem. Thus, overcoming the need for costly annotation. Although, if available, we leverage the additional information to enhance our approach.
By applying Vector Quantisation (VQ) to sign language data, we first learn a codebook of short motions that can be combined to create a natural sequence of sign. Where each token in the codebook can be thought of as the lexicon of our representation. Then using a transformer we perform a translation from spoken language text to a sequence of codebook tokens. Each token can be directly mapped to a sequence of poses allowing the translation to be performed by a single network. Furthermore, we present a sign stitching method to effectively join tokens together. We evaluate on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) and the more challenging Meine DGS Annotated (mDGS) datasets. An extensive evaluation shows our approach outperforms previous methods, increasing the BLEU-1 back translation score by up to 72%. - [1578] arXiv:2404.11500 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsComments: Accepted to the main conference of NAACL (2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper studies the relationship between the surface form of a mathematical problem and its solvability by large language models. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the language model's lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three large language models and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for language model evaluation.
- [1579] arXiv:2404.11502 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language ModelsYushuo Chen , Tianyi Tang , Erge Xiang , Linjiang Li , Wayne Xin Zhao , Jing Wang , Yunpeng Chai , Ji-Rong WenSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In real world, large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications. For the wide application of LLMs, the inference efficiency is an essential concern, which has been widely studied in existing work, and numerous optimization algorithms and code libraries have been proposed to improve it. Nonetheless, users still find it challenging to compare the effectiveness of all the above methods and understand the underlying mechanisms. In this work, we perform a detailed coarse-to-fine analysis of the inference performance of various code libraries. To evaluate the overall effectiveness, we examine four usage scenarios within two practical applications. We further provide both theoretical and empirical fine-grained analyses of each module in the Transformer architecture. Our experiments yield comprehensive results that are invaluable for researchers to evaluate code libraries and improve inference strategies.
- [1580] arXiv:2404.11534 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Decomposing and Editing Predictions by Modeling Model ComputationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at this https URL .
- [1581] arXiv:2404.11536 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedPFT: Federated Proxy Fine-Tuning of Foundation ModelsZhaopeng Peng , Xiaoliang Fan , Yufan Chen , Zheng Wang , Shirui Pan , Chenglu Wen , Ruisheng Zhang , Cheng WangComments: Accepted by IJCAI'24Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise compression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations-layer-level and neuron-level-before and during FL fine-tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theoretical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vision) demonstrate the superiority of FedPFT.
- [1582] arXiv:2404.11553 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Quantifying Multilingual Performance of Large Language Models Across LanguagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The training process of Large Language Models (LLMs) requires extensive text corpus. However, these data are often unevenly distributed in different languages. As a result, LLMs perform well on common languages, such as English, German, and French, but perform poorly on low-resource languages. However, currently there is no work to quantitatively measure the performance of LLMs in low-resource languages. To fill this gap, we proposed the Language Ranker that aims to benchmark and rank different languages according to the performance of LLMs on those languages. We employ the LLM's performance on the English corpus as a baseline to compare the performances of different languages and English. We have the following three findings: 1. The performance rankings of different LLMs in all languages are roughly the same. 2. LLMs with different sizes have the same partial order of performance. 3. There is a strong correlation between LlaMa2's performance in different languages and the proportion of the pre-training corpus. These findings illustrate that the Language Ranker can be used as an indicator to measure the language performance of LLMs.
- [1583] arXiv:2404.11565 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image GenerationComments: Project Website: this https URL , Same as previous version, only updated metadata because bib was missing an author nameSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Project page: this https URL
- [1584] arXiv:2404.11577 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Reliable Empirical Machine Unlearning Evaluation: A Game-Theoretic ViewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Machine unlearning is the process of updating machine learning models to remove the information of specific training data samples, in order to comply with data protection regulations that allow individuals to request the removal of their personal data. Despite the recent development of numerous unlearning algorithms, reliable evaluation of these algorithms remains an open research question. In this work, we focus on membership inference attack (MIA) based evaluation, one of the most common approaches for evaluating unlearning algorithms, and address various pitfalls of existing evaluation metrics that lack reliability. Specifically, we propose a game-theoretic framework that formalizes the evaluation process as a game between unlearning algorithms and MIA adversaries, measuring the data removal efficacy of unlearning algorithms by the capability of the MIA adversaries. Through careful design of the game, we demonstrate that the natural evaluation metric induced from the game enjoys provable guarantees that the existing evaluation metrics fail to satisfy. Furthermore, we propose a practical and efficient algorithm to estimate the evaluation metric induced from the game, and demonstrate its effectiveness through both theoretical analysis and empirical experiments. This work presents a novel and reliable approach to empirically evaluating unlearning algorithms, paving the way for the development of more effective unlearning techniques.
- [1585] arXiv:2404.11578 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Policy Optimization with Temporal Logic ConstraintsComments: preprint, 8 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
Abstract: Temporal logics, such as linear temporal logic (LTL), offer a precise means of specifying tasks for (deep) reinforcement learning (RL) agents. In our work, we consider the setting where the task is specified by an LTL objective and there is an additional scalar reward that we need to optimize. Previous works focus either on learning a LTL task-satisfying policy alone or are restricted to finite state spaces. We make two contributions: First, we introduce an RL-friendly approach to this setting by formulating this problem as a single optimization objective. Our formulation guarantees that an optimal policy will be reward-maximal from the set of policies that maximize the likelihood of satisfying the LTL specification. Second, we address a sparsity issue that often arises for LTL-guided Deep RL policies by introducing Cycle Experience Replay (CyclER), a technique that automatically guides RL agents towards the satisfaction of an LTL specification. Our experiments demonstrate the efficacy of CyclER in finding performant deep RL policies in both continuous and discrete experimental domains.
- [1586] arXiv:2404.11589 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept UnderstandingZezhong Fan , Xiaohan Li , Chenhao Fang , Topojoy Biswas , Kaushiki Nag , Jianpeng Xu , Kannan AchanComments: WWW 2024 CompanionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.
- [1587] arXiv:2404.11605 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VG4D: Vision-Language Model Goes 4D Video RecognitionComments: ICRA 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{ this https URL }.
- [1588] arXiv:2404.11606 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning to Solve the Constrained Most Probable Explanation Task in Probabilistic Graphical ModelsComments: Will appear in AISTATS 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We propose a self-supervised learning approach for solving the following constrained optimization task in log-linear models or Markov networks. Let $f$ and $g$ be two log-linear models defined over the sets $\mathbf{X}$ and $\mathbf{Y}$ of random variables respectively. Given an assignment $\mathbf{x}$ to all variables in $\mathbf{X}$ (evidence) and a real number $q$, the constrained most-probable explanation (CMPE) task seeks to find an assignment $\mathbf{y}$ to all variables in $\mathbf{Y}$ such that $f(\mathbf{x}, \mathbf{y})$ is maximized and $g(\mathbf{x}, \mathbf{y})\leq q$. In our proposed self-supervised approach, given assignments $\mathbf{x}$ to $\mathbf{X}$ (data), we train a deep neural network that learns to output near-optimal solutions to the CMPE problem without requiring access to any pre-computed solutions. The key idea in our approach is to use first principles and approximate inference methods for CMPE to derive novel loss functions that seek to push infeasible solutions towards feasible ones and feasible solutions towards optimal ones. We analyze the properties of our proposed method and experimentally demonstrate its efficacy on several benchmark problems.
- [1589] arXiv:2404.11630 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SNP: Structured Neuron-level Pruning to Preserve Attention ScoresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1$\times$ faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85$\times$ faster inference speed on RTX3090 and 4.93$\times$ on Jetson Nano.
- [1590] arXiv:2404.11667 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Dependency Networks and Advanced Inference Schemes for Multi-Label ClassificationComments: Will appear in AISTATS 2024. arXiv admin note: substantial text overlap with arXiv:2302.00633Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Abstract: We present a unified framework called deep dependency networks (DDNs) that combines dependency networks and deep learning architectures for multi-label classification, with a particular emphasis on image and video data. The primary advantage of dependency networks is their ease of training, in contrast to other probabilistic graphical models like Markov networks. In particular, when combined with deep learning architectures, they provide an intuitive, easy-to-use loss function for multi-label classification. A drawback of DDNs compared to Markov networks is their lack of advanced inference schemes, necessitating the use of Gibbs sampling. To address this challenge, we propose novel inference schemes based on local search and integer linear programming for computing the most likely assignment to the labels given observations. We evaluate our novel methods on three video datasets (Charades, TACoS, Wetlab) and three image datasets (MS-COCO, PASCAL VOC, NUS-WIDE), comparing their performance with (a) basic neural architectures and (b) neural architectures combined with Markov networks equipped with advanced inference and learning techniques. Our results demonstrate the superiority of our new DDN methods over the two competing approaches.
- [1591] arXiv:2404.11730 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Missed Connections: Lateral Thinking Puzzles for Large Language ModelsComments: 8 pages, 3 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.
- [1592] arXiv:2404.11754 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Improved Generalization Bounds for Communication Efficient Federated LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.
- [1593] arXiv:2404.11770 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Event-Based Eye Tracking. AIS 2024 Challenge SurveyZuowen Wang , Chang Gao , Zongwei Wu , Marcos V. Conde , Radu Timofte , Shih-Chii Liu , Qinyu Chen , Zheng-jun Zha , Wei Zhai , Han Han , Bohao Liao , Yuliang Wu , Zengyu Wan , Zhong Wang , Yang Cao , Ganchao Tan , Jinze Chen , Yan Ru Pei , Sasskia Brüers , Sébastien Crouzet , Douglas McLelland , Oliver Coenen , Baoheng Zhang , Yizhao Gao , Jingyuan Li , Hayden Kwok-Hay So , Philippe Bich , Chiara Boretti , Luciano Prono , Mircea Lică , David Dinucu-Jianu , Cătălin Grîu , Xiaopeng Lin , Hongwei Ren , Bojun Cheng , Xinan Zhang , Valentin Vial , Anthony Yezzi , James TsaiComments: Qinyu Chen is the corresponding authorSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.
- [1594] arXiv:2404.11773 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommendation SystemsComments: Accepted by the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS). However, the application of LLMs to CRS has exposed a notable discrepancy in behavior between LLM-based CRS and human recommenders: LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry.This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction. Despite its importance, existing studies in CRS lack a study about how to measure such behavior discrepancy. To fill this gap, we propose Behavior Alignment, a new evaluation metric to measure how well the recommendation strategies made by a LLM-based CRS are consistent with human recommenders'. Our experiment results show that the new metric is better aligned with human preferences and can better differentiate how systems perform than existing evaluation metrics. As Behavior Alignment requires explicit and costly human annotations on the recommendation strategies, we also propose a classification-based method to implicitly measure the Behavior Alignment based on the responses. The evaluation results confirm the robustness of the method.
- [1595] arXiv:2404.11782 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: REQUAL-LM: Reliability and Equity through Aggregation in Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: The extensive scope of large language models (LLMs) across various domains underscores the critical importance of responsibility in their application, beyond natural language processing. In particular, the randomized nature of LLMs, coupled with inherent biases and historical stereotypes in data, raises critical concerns regarding reliability and equity. Addressing these challenges are necessary before using LLMs for applications with societal impact. Towards addressing this gap, we introduce REQUAL-LM, a novel method for finding reliable and equitable LLM outputs through aggregation. Specifically, we develop a Monte Carlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs. We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output. REQUAL-LM does not require specialized hardware, does not impose a significant computing load, and uses LLMs as a blackbox. This design choice enables seamless scalability alongside the rapid advancement of LLM technologies. Our system does not require retraining the LLMs, which makes it deployment ready and easy to adapt. Our comprehensive experiments using various tasks and datasets demonstrate that REQUAL- LM effectively mitigates bias and selects a more equitable response, specifically the outputs that properly represents minority groups.
- [1596] arXiv:2404.11795 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Prompt-Driven Feature Diffusion for Open-World Semi-Supervised LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.
- [1597] arXiv:2404.11797 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral ImageryYiqun Xie , Zhihao Wang , Weiye Chen , Zhili Li , Xiaowei Jia , Yanhua Li , Ruichen Wang , Kangyang Chai , Ruohan Li , Sergii SkakunSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.
- [1598] arXiv:2404.11800 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Developing Situational Awareness for Joint Action with Autonomous VehiclesComments: 16 pagesSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Unanswered questions about how human-AV interaction designers can support rider's informational needs hinders Autonomous Vehicles (AV) adoption. To achieve joint human-AV action goals - such as safe transportation, trust, or learning from an AV - sufficient situational awareness must be held by the human, AV, and human-AV system collectively. We present a systems-level framework that integrates cognitive theories of joint action and situational awareness as a means to tailor communications that meet the criteria necessary for goal success. This framework is based on four components of the shared situation: AV traits, action goals, subject-specific traits and states, and the situated driving context. AV communications should be tailored to these factors and be sensitive when they change. This framework can be useful for understanding individual, shared, and distributed human-AV situational awareness and designing for future AV communications that meet the informational needs and goals of diverse groups and in diverse driving contexts.
- [1599] arXiv:2404.11803 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal AggregationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.
- [1600] arXiv:2404.11811 (cross-list from physics.chem-ph) [ pdf , ps , other ]
-
Title: Physics-informed active learning for accelerating quantum chemical simulationsSubjects: Chemical Physics (physics.chem-ph) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Quantum chemical simulations can be greatly accelerated by constructing machine learning potentials, which is often done using active learning (AL). The usefulness of the constructed potentials is often limited by the high effort required and their insufficient robustness in the simulations. Here we introduce the end-to-end AL for constructing robust data-efficient potentials with affordable investment of time and resources and minimum human interference. Our AL protocol is based on the physics-informed sampling of training points, automatic selection of initial data, and uncertainty quantification. The versatility of this protocol is shown in our implementation of quasi-classical molecular dynamics for simulating vibrational spectra, conformer search of a key biochemical molecule, and time-resolved mechanism of the Diels-Alder reaction. These investigations took us days instead of weeks of pure quantum chemical calculations on a high-performance computing cluster.
- [1601] arXiv:2404.11812 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Cross-model Mutual Learning for Exemplar-based Medical Image SegmentationComments: AISTATS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Medical image segmentation typically demands extensive dense annotations for model training, which is both time-consuming and skill-intensive. To mitigate this burden, exemplar-based medical image segmentation methods have been introduced to achieve effective training with only one annotated image. In this paper, we introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image Segmentation (CMEMS), which leverages two models to mutually excavate implicit information from unlabeled data at multiple granularities. CMEMS can eliminate confirmation bias and enable collaborative training to learn complementary information by enforcing consistency at different granularities across models. Concretely, cross-model image perturbation based mutual learning is devised by using weakly perturbed images to generate high-confidence pseudo-labels, supervising predictions of strongly perturbed images across models. This approach enables joint pursuit of prediction consistency at the image granularity. Moreover, cross-model multi-level feature perturbation based mutual learning is designed by letting pseudo-labels supervise predictions from perturbed multi-level features with different resolutions, which can broaden the perturbation space and enhance the robustness of our framework. CMEMS is jointly trained using exemplar data, synthetic data, and unlabeled data in an end-to-end manner. Experimental results on two medical image datasets indicate that the proposed CMEMS outperforms the state-of-the-art segmentation methods with extremely limited supervision.
- [1602] arXiv:2404.11874 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Using a Local Surrogate Model to Interpret Temporal Shifts in Global Annual DataComments: There are 9 pages and 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper focuses on explaining changes over time in globally-sourced, annual temporal data, with the specific objective of identifying pivotal factors that contribute to these temporal shifts. Leveraging such analytical frameworks can yield transformative impacts, including the informed refinement of public policy and the identification of key drivers affecting a country's economic evolution. We employ Local Interpretable Model-agnostic Explanations (LIME) to shed light on national happiness indices, economic freedom, and population metrics, spanning variable time frames. Acknowledging the presence of missing values, we employ three imputation approaches to generate robust multivariate time-series datasets apt for LIME's input requirements. Our methodology's efficacy is substantiated through a series of empirical evaluations involving multiple datasets. These evaluations include comparative analyses against random feature selection, correlation with real-world events as elucidated by LIME, and validation through Individual Conditional Expectation (ICE) plots, a state-of-the-art technique proficient in feature importance detection.
- [1603] arXiv:2404.11888 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The Dog Walking Theory: Rethinking Convergence in Federated LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning (FL) is a collaborative learning paradigm that allows different clients to train one powerful global model without sharing their private data. Although FL has demonstrated promising results in various applications, it is known to suffer from convergence issues caused by the data distribution shift across different clients, especially on non-independent and identically distributed (non-IID) data. In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research. The Dog Walking Theory describes the process of a dog walker leash walking multiple dogs from one side of the park to the other. The goal of the dog walker is to arrive at the right destination while giving the dogs enough exercise (i.e., space exploration). In FL, the server is analogous to the dog walker while the clients are analogous to the dogs. This analogy allows us to identify one crucial yet missing element in existing FL algorithms: the leash that guides the exploration of the clients. To address this gap, we propose a novel FL algorithm \emph{FedWalk} that leverages an external easy-to-converge task at the server side as a \emph{leash task} to guide the local training of the clients. We theoretically analyze the convergence of FedWalk with respect to data heterogeneity (between server and clients) and task discrepancy (between the leash and the original tasks). Experiments on multiple benchmark datasets demonstrate the superiority of FedWalk over state-of-the-art FL methods under both IID and non-IID settings.
- [1604] arXiv:2404.11916 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-UpComments: 6 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Prompt-tuning methods have shown comparable performance as parameter-efficient fine-tuning (PEFT) methods in various natural language understanding tasks. However, existing prompt tuning methods still utilize the entire model architecture; thus, they fail to accelerate inference speed in the application. In this paper, we propose a novel approach called SKIll-localized Prompt tuning (SKIP), which is extremely efficient in inference time. Our method significantly enhances inference efficiency by investigating and utilizing a skill-localized subnetwork in a language model. Surprisingly, our method improves the inference speed up to 160% while pruning 52% of the parameters. Furthermore, we demonstrate that our method is applicable across various transformer-based architectures, thereby confirming its practicality and scalability.
- [1605] arXiv:2404.11917 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Expected Coordinate Improvement for High-Dimensional Bayesian OptimizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Bayesian optimization (BO) algorithm is very popular for solving low-dimensional expensive optimization problems. Extending Bayesian optimization to high dimension is a meaningful but challenging task. One of the major challenges is that it is difficult to find good infill solutions as the acquisition functions are also high-dimensional. In this work, we propose the expected coordinate improvement (ECI) criterion for high-dimensional Bayesian optimization. The proposed ECI criterion measures the potential improvement we can get by moving the current best solution along one coordinate. The proposed approach selects the coordinate with the highest ECI value to refine in each iteration and covers all the coordinates gradually by iterating over the coordinates. The greatest advantage of the proposed ECI-BO (expected coordinate improvement based Bayesian optimization) algorithm over the standard BO algorithm is that the infill selection problem of the proposed algorithm is always a one-dimensional problem thus can be easily solved. Numerical experiments show that the proposed algorithm can achieve significantly better results than the standard BO algorithm and competitive results when compared with five state-of-the-art high-dimensional BOs. This work provides a simple but efficient approach for high-dimensional Bayesian optimization.
- [1606] arXiv:2404.11925 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: EdgeFusion: On-Device Text-to-Image GenerationThibault Castells , Hyoung-Kyu Song , Tairen Piao , Shinkook Choi , Bo-Kyeong Kim , Hanyoung Yim , Changgwun Lee , Jae Gon Kim , Tae-Ho KimComments: 4 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.
- [1607] arXiv:2404.11929 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: A Symmetric Regressor for MRI-Based Assessment of Striatal Dopamine Transporter Uptake in Parkinson's DiseaseSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Dopamine transporter (DAT) imaging is commonly used for monitoring Parkinson's disease (PD), where striatal DAT uptake amount is computed to assess PD severity. However, DAT imaging has a high cost and the risk of radiance exposure and is not available in general clinics. Recently, MRI patch of the nigral region has been proposed as a safer and easier alternative. This paper proposes a symmetric regressor for predicting the DAT uptake amount from the nigral MRI patch. Acknowledging the symmetry between the right and left nigrae, the proposed regressor incorporates a paired input-output model that simultaneously predicts the DAT uptake amounts for both the right and left striata. Moreover, it employs a symmetric loss that imposes a constraint on the difference between right-to-left predictions, resembling the high correlation in DAT uptake amounts in the two lateral sides. Additionally, we propose a symmetric Monte-Carlo (MC) dropout method for providing a fruitful uncertainty estimate of the DAT uptake prediction, which utilizes the above symmetry. We evaluated the proposed approach on 734 nigral patches, which demonstrated significantly improved performance of the symmetric regressor compared with the standard regressors while giving better explainability and feature representation. The symmetric MC dropout also gave precise uncertainty ranges with a high probability of including the true DAT uptake amounts within the range.
- [1608] arXiv:2404.11932 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge AlignmentComments: 11 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.
- [1609] arXiv:2404.11936 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic InsightsComments: 8 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.
- [1610] arXiv:2404.11949 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Sketch-guided Image Inpainting with Partial Discrete Diffusion ProcessComments: Accepted to NTIRE Workshop @ CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at this https URL .
- [1611] arXiv:2404.11960 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM RankersSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: The most recent pointwise Large Language Model (LLM) rankers have achieved remarkable ranking results. However, these rankers are hindered by two major drawbacks: (1) they fail to follow a standardized comparison guidance during the ranking process, and (2) they struggle with comprehensive considerations when dealing with complicated passages. To address these shortcomings, we propose to build a ranker that generates ranking scores based on a set of criteria from various perspectives. These criteria are intended to direct each perspective in providing a distinct yet synergistic evaluation. Our research, which examines eight datasets from the BEIR benchmark demonstrates that incorporating this multi-perspective criteria ensemble approach markedly enhanced the performance of pointwise LLM rankers.
- [1612] arXiv:2404.11999 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Token-level Direct Preference OptimizationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at this https URL .
- [1613] arXiv:2404.12008 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: How Do Recommendation Models Amplify Popularity Bias? An Analysis from the Spectral PerspectiveComments: 23 pages, 7 figuresSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Recommendation Systems (RS) are often plagued by popularity bias. Specifically,when recommendation models are trained on long-tailed datasets, they not only inherit this bias but often exacerbate it. This effect undermines both the precision and fairness of RS and catalyzes the so-called Matthew Effect. Despite the widely recognition of this issue, the fundamental causes remain largely elusive. In our research, we delve deeply into popularity bias amplification. Our comprehensive theoretical and empirical investigations lead to two core insights: 1) Item popularity is memorized in the principal singular vector of the score matrix predicted by the recommendation model; 2) The dimension collapse phenomenon amplifies the impact of principal singular vector on model predictions, intensifying the popularity bias. Based on these insights, we propose a novel method to mitigate this bias by imposing penalties on the magnitude of the principal singular value. Considering the heavy computational burden in directly evaluating the gradient of the principal singular value, we develop an efficient algorithm that harnesses the inherent properties of the singular vector. Extensive experiments across seven real-world datasets and three testing scenarios have been conducted to validate the superiority of our method.
- [1614] arXiv:2404.12010 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic DiversitySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.
- [1615] arXiv:2404.12023 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Context-Aware Orchestration of Energy-Efficient Gossip Learning SchemesComments: IEEE AIIOT 2024Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Fully distributed learning schemes such as Gossip Learning (GL) are gaining momentum due to their scalability and effectiveness even in dynamic settings. However, they often imply a high utilization of communication and computing resources, whose energy footprint may jeopardize the learning process, particularly on battery-operated IoT devices. To address this issue, we present Optimized Gossip Learning (OGL)}, a distributed training approach based on the combination of GL with adaptive optimization of the learning process, which allows for achieving a target accuracy while minimizing the energy consumption of the learning process. We propose a data-driven approach to OGL management that relies on optimizing in real-time for each node the number of training epochs and the choice of which model to exchange with neighbors based on patterns of node contacts, models' quality, and available resources at each node. Our approach employs a DNN model for dynamic tuning of the aforementioned parameters, trained by an infrastructure-based orchestrator function. We performed our assessments on two different datasets, leveraging time-varying random graphs and a measurement-based dynamic urban scenario. Results suggest that our approach is highly efficient and effective in a broad spectrum of network scenarios.
- [1616] arXiv:2404.12030 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Mapping back and forth between model predictive control and neural networksComments: 13 pagesSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI)
Abstract: Model predictive control (MPC) for linear systems with quadratic costs and linear constraints is shown to admit an exact representation as an implicit neural network. A method to "unravel" the implicit neural network of MPC into an explicit one is also introduced. As well as building links between model-based and data-driven control, these results emphasize the capability of implicit neural networks for representing solutions of optimisation problems, as such problems are themselves implicitly defined functions.
- [1617] arXiv:2404.12041 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A SurveyComments: 19 pages in total, with 9 pages as main body. Under review as a conference paper at CoLM 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text. For Large Language Models (LLMs), hallucinations can happen in various downstream tasks and casual conversations, which need accurate assessment to enhance reliability and safety. However, current studies on hallucination evaluation vary greatly, and people still find it difficult to sort out and select the most appropriate evaluation methods. Moreover, as NLP research gradually shifts to the domain of LLMs, it brings new challenges to this direction. This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, aiming to address three key aspects: 1) Diverse definitions and granularity of facts; 2) The categories of automatic evaluators and their applicability; 3) Unresolved issues and future directions.
- [1618] arXiv:2404.12056 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Deconstructing Human-AI Collaboration: Agency, Interaction, and AdaptationComments: 10 pages, 4 figures. Accepted to Proceedings of EuroVis 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: As full AI-based automation remains out of reach in most real-world applications, the focus has instead shifted to leveraging the strengths of both human and AI agents, creating effective collaborative systems. The rapid advances in this area have yielded increasingly more complex systems and frameworks, while the nuance of their characterization has gotten more vague. Similarly, the existing conceptual models no longer capture the elaborate processes of these systems nor describe the entire scope of their collaboration paradigms. In this paper, we propose a new unified set of dimensions through which to analyze and describe human-AI systems. Our conceptual model is centered around three high-level aspects - agency, interaction, and adaptation - and is developed through a multi-step process. Firstly, an initial design space is proposed by surveying the literature and consolidating existing definitions and conceptual frameworks. Secondly, this model is iteratively refined and validated by conducting semi-structured interviews with nine researchers in this field. Lastly, to illustrate the applicability of our design space, we utilize it to provide a structured description of selected human-AI systems.
- [1619] arXiv:2404.12065 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language ModelsComments: 8 pages, submitted to ACL Rolling ReviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
Abstract: The escalating challenge of misinformation, particularly in the context of political discourse, necessitates advanced solutions for fact-checking. We introduce innovative approaches to enhance the reliability and efficiency of multimodal fact-checking through the integration of Large Language Models (LLMs) with Retrieval-augmented Generation (RAG)- based advanced reasoning techniques. This work proposes two novel methodologies, Chain of RAG (CoRAG) and Tree of RAG (ToRAG). The approaches are designed to handle multimodal claims by reasoning the next questions that need to be answered based on previous evidence. Our approaches improve the accuracy of veracity predictions and the generation of explanations over the traditional fact-checking approach of sub-question generation with chain of thought veracity prediction. By employing multimodal LLMs adept at analyzing both text and images, this research advances the capability of automated systems in identifying and countering misinformation.
- [1620] arXiv:2404.12077 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning ApproachesSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Abstract: This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and challenges of multi-task learning versus single-task models. The motivation for this research is twofold: firstly, to empirically assess the advantages and drawbacks of multi-task learning over single-task models in the context of speaker profiling; secondly, to emphasize the undiminished significance of skillful feature engineering for speaker recognition tasks. The findings reveal challenges in accent classification, and multi-task learning is found advantageous for tasks of similar complexity. Non-sequential features are favored for speaker recognition, but sequential ones can serve as starting points for complex models. The study underscores the necessity of meticulous experimentation and parameter tuning for deep learning models.
- [1621] arXiv:2404.12120 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Fortify the Guardian, Not the Treasure: Resilient Adversarial DetectorsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.
- [1622] arXiv:2404.12125 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Intelligence Education made in EuropeComments: 16 pages, 2 figures. No potential conflict of interest was reported by the authorsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Global conflicts and trouble spots have thrown the world into turmoil. Intelligence services have never been as necessary as they are today when it comes to providing political decision-makers with concrete, accurate, and up-to-date decision-making knowledge. This requires a common co-operation, a common working language and a common understanding of each other. The best way to create this "intelligence community" is through a harmonized intelligence education.
In this paper, we show how joint intelligence education can succeed. We draw on the experience of Germany, where all intelligence services and the Bundeswehr are academically educated together in a single degree program that lays the foundations for a common working language. We also show how these experiences have been successfully transferred to a European level, namely to ICE, the Intelligence College in Europe. Our experience has shown that three aspects are particularly important: firstly, interdisciplinarity or better, transdisciplinarity, secondly, the integration of IT knowhow and thirdly, the development and learning of methodological skills. Using the example of the cyber intelligence module with a special focus on data-driven decision support, additionally with its many points of reference to numerous other academic modules, we show how the specific analytic methodology presented is embedded in our specific European teaching context. - [1623] arXiv:2404.12145 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense ConsistencySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.
- [1624] arXiv:2404.12168 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Real-World Efficient Blind Motion Deblurring via Blur Pixel DiscretizationComments: CVPR2024 Camera-ReadySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.
- [1625] arXiv:2404.12172 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: How to Benchmark Vision Foundation Models for Semantic Segmentation?Comments: CVPR 2024 Workshop Proceedings for the Second Workshop on Foundation ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: this https URL .
- [1626] arXiv:2404.12237 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: De-DSI: Decentralised Differentiable Search IndexComments: Accepted at the 4th Workshop on Machine Learning and Systems (EuroMLSys), EuroSys 2024Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.
- [1627] arXiv:2404.12241 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Introducing v0.5 of the AI Safety Benchmark from MLCommonsBertie Vidgen , Adarsh Agrawal , Ahmed M. Ahmed , Victor Akinwande , Namir Al-Nuaimi , Najla Alfaraj , Elie Alhajjar , Lora Aroyo , Trupti Bavalatti , Borhane Blili-Hamelin , Kurt Bollacker , Rishi Bomassani , Marisa Ferrara Boston , Siméon Campos , Kal Chakra , Canyu Chen , Cody Coleman , Zacharie Delpierre Coudert , Leon Derczynski , Debojyoti Dutta , Ian Eisenberg , James Ezick , Heather Frase , Brian Fuller , Ram Gandikota , Agasthya Gangavarapu , Ananya Gangavarapu , James Gealy , Rajat Ghosh , James Goel , Usman Gohar , Sujata Goswami , Scott A. Hale , Wiebke Hutiri , Joseph Marvin Imperial , Surgan Jandial , Nick Judd , Felix Juefei-Xu , Foutse Khomh , Bhavya Kailkhura , Hannah Rose Kirk , Kevin Klyman , Chris Knotz , Michael Kuchnik , Shachi H. Kumar , Chris Lengerich , Bo Li , Zeyi Liao , Eileen Peters Long , Victor Lu , Yifan Mai , Priyanka Mary Mammen , Kelvin Manyeki , Sean McGregor , Virendra Mehta , Shafee Mohammed , Emanuel Moss , Lama Nachman , Dinesh Jinenhally Naganna , Amin Nikanjam , Besmira Nushi , Luis Oala , Iftach Orr , Alicia Parrish , Cigdem Patlak , William Pietri , Forough Poursabzi-Sangdeh , Eleonora Presani , Fabrizio Puletti , Paul Röttger , Saurav Sahay , Tim Santos , Nino Scherrer , Alice Schoenauer Sebag , Patrick Schramowski , Abolfazl Shahbazi , Vin Sharma , Xudong Shen , Vamsi Sistla , Leonard Tang , Davide Testuggine , Vithursan Thangarasa , Elizabeth Anne Watkins , Rebecca Weiss , Chris Welty , Tyler Wilbers , Adina Williams , Carole-Jean Wu , Poonam Yadav , Xianjun Yang , Yi Zeng , Wenhui Zhang , Fedor Zhdanov , Jiacheng Zhu , Percy Liang , Peter Mattson , Joaquin VanschorenSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
- [1628] arXiv:2404.12256 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: An Online Spatial-Temporal Graph Trajectory Planner for Autonomous VehiclesComments: This is the accepted version and published in the "Early Access" area of IEEE Xplore for the IEEE Transactions on Intelligent Vehicles on 16 April 2024. Article statistics: 11 pages, 9 figures, 2 tablesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The autonomous driving industry is expected to grow by over 20 times in the coming decade and, thus, motivate researchers to delve into it. The primary focus of their research is to ensure safety, comfort, and efficiency. An autonomous vehicle has several modules responsible for one or more of the aforementioned items. Among these modules, the trajectory planner plays a pivotal role in the safety of the vehicle and the comfort of its passengers. The module is also responsible for respecting kinematic constraints and any applicable road constraints. In this paper, a novel online spatial-temporal graph trajectory planner is introduced to generate safe and comfortable trajectories. First, a spatial-temporal graph is constructed using the autonomous vehicle, its surrounding vehicles, and virtual nodes along the road with respect to the vehicle itself. Next, the graph is forwarded into a sequential network to obtain the desired states. To support the planner, a simple behavioral layer is also presented that determines kinematic constraints for the planner. Furthermore, a novel potential function is also proposed to train the network. Finally, the proposed planner is tested on three different complex driving tasks, and the performance is compared with two frequently used methods. The results show that the proposed planner generates safe and feasible trajectories while achieving similar or longer distances in the forward direction and comparable comfort ride.
- [1629] arXiv:2404.12257 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Food Portion Estimation via 3D Object ScalingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Abstract: Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods.
- [1630] arXiv:2404.12259 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooMComments: To appear at CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
- [1631] arXiv:2404.12267 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoderSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.
- [1632] arXiv:2404.12272 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human PreferencesComments: 16 pages, 4 figures, 2 tablesSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
- [1633] arXiv:2404.12274 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Advancing the Robustness of Large Language Models through Self-Denoised SmoothingJiabao Ji , Bairu Hou , Zhen Zhang , Guanhua Zhang , Wenqi Fan , Qing Li , Yang Zhang , Gaowen Liu , Sijia Liu , Shiyu ChangComments: Accepted by NAACL 2024. Jiabao, Bairu, Zhen, Guanhua contributed equally. This is an updated version of the paper: arXiv:2307.07171Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at this https URL
- [1634] arXiv:2404.12291 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Augmenting emotion features in irony detection with Large language modelingComments: 11 pages, 3 tables, 2 figures. Accepted by the 25th Chinese Lexical Semantics WorkshopSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
- [1635] arXiv:2404.12299 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language PairComments: 23 pages, 9 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{ this https URL }.
- [1636] arXiv:2404.12317 (cross-list from cs.CE) [ pdf , ps , other ]
-
Title: Large Language Models for Synthetic Participatory Planning of Synergistic Transportation SystemsSubjects: Computational Engineering, Finance, and Science (cs.CE) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
Abstract: Unleashing the synergies of rapidly evolving mobility technologies in a multi-stakeholder landscape presents unique challenges and opportunities for addressing urban transportation problems. This paper introduces a novel synthetic participatory method, critically leveraging large language models (LLMs) to create digital avatars representing diverse stakeholders to plan shared automated electric mobility systems (SAEMS). These calibratable agents collaboratively identify objectives, envision and evaluate SAEMS alternatives, and strategize implementation under risks and constraints. The results of a Montreal case study indicate that a structured and parameterized workflow provides outputs with high controllability and comprehensiveness on an SAEMS plan than generated using a single LLM-enabled expert agent. Consequently, the approach provides a promising avenue for cost-efficiently improving the inclusivity and interpretability of multi-objective transportation planning, suggesting a paradigm shift in how we envision and strategize for sustainable and equitable transportation systems.
- [1637] arXiv:2404.12322 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Generalizable Face Landmarking Guided by Conditional Face WarpingComments: Accepted in CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study, we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images, the conditional face warper predicts a warping field from the real face to the stylized one, in which the face landmarker predicts the ending points of the warping field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy, we learn the face landmarker to minimize $i)$ the discrepancy between the stylized faces and the warped real ones and $ii)$ the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks, leading to a face landmarker with better generalizability. Code is available at this https URL .
- [1638] arXiv:2404.12353 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction TuningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
- [1639] arXiv:2404.12365 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many ClassesComments: Accepted to NAACLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.
- [1640] arXiv:2404.12378 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: 6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at this https URL , and more examples can be found at our website here this https URL .
- [1641] arXiv:2404.12382 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Lazy Diffusion Transformer for Interactive Image EditingYotam Nitzan , Zongze Wu , Richard Zhang , Eli Shechtman , Daniel Cohen-Or , Taesung Park , Michaël GharbiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.
- [1642] arXiv:2404.12390 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: BLINK: Multimodal Large Language Models Can See but Not PerceiveXingyu Fu , Yushi Hu , Bangzheng Li , Yu Feng , Haoyu Wang , Xudong Lin , Dan Roth , Noah A. Smith , Wei-Chiu Ma , Ranjay KrishnaComments: Multimodal Benchmark, Project Url: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
- [1643] arXiv:2404.12399 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Model Failure or Data Corruption? Exploring Inconsistencies in Building Energy Ratings with Self-Supervised Contrastive LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Building Energy Rating (BER) stands as a pivotal metric, enabling building owners, policymakers, and urban planners to understand the energy-saving potential through improving building energy efficiency. As such, enhancing buildings' BER levels is expected to directly contribute to the reduction of carbon emissions and promote climate improvement. Nonetheless, the BER assessment process is vulnerable to missing and inaccurate measurements. In this study, we introduce \texttt{CLEAR}, a data-driven approach designed to scrutinize the inconsistencies in BER assessments through self-supervised contrastive learning. We validated the effectiveness of \texttt{CLEAR} using a dataset representing Irish building stocks. Our experiments uncovered evidence of inconsistent BER assessments, highlighting measurement data corruption within this real-world dataset.
- [1644] arXiv:2404.12401 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Items or Relations -- what do Artificial Neural Networks learn?Comments: 9 pages, 2 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE)
Abstract: What has an Artificial Neural Network (ANN) learned after being successfully trained to solve a task - the set of training items or the relations between them? This question is difficult to answer for modern applied ANNs because of their enormous size and complexity. Therefore, here we consider a low-dimensional network and a simple task, i.e., the network has to reproduce a set of training items identically. We construct the family of solutions analytically and use standard learning algorithms to obtain numerical solutions. These numerical solutions differ depending on the optimization algorithm and the weight initialization and are shown to be particular members of the family of analytical solutions. In this simple setting, we observe that the general structure of the network weights represents the training set's symmetry group, i.e., the relations between training items. As a consequence, linear networks generalize, i.e., reproduce items that were not part of the training set but are consistent with the symmetry of the training set. In contrast, non-linear networks tend to learn individual training items and show associative memory. At the same time, their ability to generalize is limited. A higher degree of generalization is obtained for networks whose activation function contains a linear regime, such as tanh. Our results suggest ANN's ability to generalize - instead of learning items - could be improved by generating a sufficiently big set of elementary operations to represent relations and strongly depends on the applied non-linearity.
- [1645] arXiv:2404.12402 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Sup3r: A Semi-Supervised Algorithm for increasing Sparsity, Stability, and Separability in Hierarchy Of Time-Surfaces architecturesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: The Hierarchy Of Time-Surfaces (HOTS) algorithm, a neuromorphic approach for feature extraction from event data, presents promising capabilities but faces challenges in accuracy and compatibility with neuromorphic hardware. In this paper, we introduce Sup3r, a Semi-Supervised algorithm aimed at addressing these challenges. Sup3r enhances sparsity, stability, and separability in the HOTS networks. It enables end-to-end online training of HOTS networks replacing external classifiers, by leveraging semi-supervised learning. Sup3r learns class-informative patterns, mitigates confounding features, and reduces the number of processed events. Moreover, Sup3r facilitates continual and incremental learning, allowing adaptation to data distribution shifts and learning new tasks without forgetting. Preliminary results on N-MNIST demonstrate that Sup3r achieves comparable accuracy to similarly sized Artificial Neural Networks trained with back-propagation. This work showcases the potential of Sup3r to advance the capabilities of HOTS networks, offering a promising avenue for neuromorphic algorithms in real-world applications.
- [1646] arXiv:2404.12403 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multi-Objective Hardware Aware Neural Architecture Search using Hardware Cost DiversityComments: Accepted at the CVPR 2024 Workshop, called "Efficient Deep Learning for Computer Vision (ECV)"Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Hardware-aware Neural Architecture Search approaches (HW-NAS) automate the design of deep learning architectures, tailored specifically to a given target hardware platform. Yet, these techniques demand substantial computational resources, primarily due to the expensive process of assessing the performance of identified architectures. To alleviate this problem, a recent direction in the literature has employed representation similarity metric for efficiently evaluating architecture performance. Nonetheless, since it is inherently a single objective method, it requires multiple runs to identify the optimal architecture set satisfying the diverse hardware cost constraints, thereby increasing the search cost. Furthermore, simply converting the single objective into a multi-objective approach results in an under-explored architectural search space. In this study, we propose a Multi-Objective method to address the HW-NAS problem, called MO-HDNAS, to identify the trade-off set of architectures in a single run with low computational cost. This is achieved by optimizing three objectives: maximizing the representation similarity metric, minimizing hardware cost, and maximizing the hardware cost diversity. The third objective, i.e. hardware cost diversity, is used to facilitate a better exploration of the architecture search space. Experimental results demonstrate the effectiveness of our proposed method in efficiently addressing the HW-NAS problem across six edge devices for the image classification task.
- [1647] arXiv:2404.12404 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Group-wise Prompting for Synthetic Tabular Data Generation using Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Generating realistic synthetic tabular data presents a critical challenge in machine learning. This study introduces a simple yet effective method employing Large Language Models (LLMs) tailored to generate synthetic data, specifically addressing data imbalance problems. We propose a novel group-wise prompting method in CSV-style formatting that leverages the in-context learning capabilities of LLMs to produce data that closely adheres to the specified requirements and characteristics of the target dataset. Moreover, our proposed random word replacement strategy significantly improves the handling of monotonous categorical values, enhancing the accuracy and representativeness of the synthetic data. The effectiveness of our method is extensively validated across eight real-world public datasets, achieving state-of-the-art performance in downstream classification and regression tasks while maintaining inter-feature correlations and improving token efficiency over existing approaches. This advancement significantly contributes to addressing the key challenges of machine learning applications, particularly in the context of tabular data generation and handling class imbalance. The source code for our work is available at: this https URL
- [1648] arXiv:2404.12444 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?Comments: Accepted at Findings of NAACL 2024. Project Webpage: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Many pretrained multilingual models exhibit cross-lingual transfer ability, which is often attributed to a learned language-neutral representation during pretraining. However, it remains unclear what factors contribute to the learning of a language-neutral representation, and whether the learned language-neutral representation suffices to facilitate cross-lingual transfer. We propose a synthetic task, Multilingual Othello (mOthello), as a testbed to delve into these two questions. We find that: (1) models trained with naive multilingual pretraining fail to learn a language-neutral representation across all input languages; (2) the introduction of "anchor tokens" (i.e., lexical items that are identical across languages) helps cross-lingual representation alignment; and (3) the learning of a language-neutral representation alone is not sufficient to facilitate cross-lingual transfer. Based on our findings, we propose a novel approach - multilingual pretraining with unified output space - that both induces the learning of language-neutral representation and facilitates cross-lingual transfer.
- [1649] arXiv:2404.12450 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing AI Diagnostics: Autonomous Lesion Masking via Semi-Supervised Deep LearningTing-Ruen Wei , Michele Hell , Dang Bich Thuy Le , Aren Vierra , Ran Pang , Mahesh Patel , Young Kang , Yuling YanSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study presents an unsupervised domain adaptation method aimed at autonomously generating image masks outlining regions of interest (ROIs) for differentiating breast lesions in breast ultrasound (US) imaging. Our semi-supervised learning approach utilizes a primitive model trained on a small public breast US dataset with true annotations. This model is then iteratively refined for the domain adaptation task, generating pseudo-masks for our private, unannotated breast US dataset. The dataset, twice the size of the public one, exhibits considerable variability in image acquisition perspectives and demographic representation, posing a domain-shift challenge. Unlike typical domain adversarial training, we employ downstream classification outcomes as a benchmark to guide the updating of pseudo-masks in subsequent iterations. We found the classification precision to be highly correlated with the completeness of the generated ROIs, which promotes the explainability of the deep learning classification model. Preliminary findings demonstrate the efficacy and reliability of this approach in streamlining the ROI annotation process, thereby enhancing the classification and localization of breast lesions for more precise and interpretable diagnoses.
- [1650] arXiv:2404.12474 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning a Stable, Safe, Distributed Feedback Controller for a Heterogeneous Platoon of VehiclesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
Abstract: Platooning of autonomous vehicles has the potential to increase safety and fuel efficiency on highways. The goal of platooning is to have each vehicle drive at some speed (set by the leader) while maintaining a safe distance from its neighbors. Many prior works have analyzed various controllers for platooning, most commonly linear feedback and distributed model predictive controllers. In this work, we introduce an algorithm for learning a stable, safe, distributed controller for a heterogeneous platoon. Our algorithm relies on recent developments in learning neural network stability and safety certificates. We train a controller for autonomous platooning in simulation and evaluate its performance on hardware with a platoon of four F1Tenth vehicles. We then perform further analysis in simulation with a platoon of 100 vehicles. Experimental results demonstrate the practicality of the algorithm and the learned controller by comparing the performance of the neural network controller to linear feedback and distributed model predictive controllers.
- [1651] arXiv:2404.12485 (cross-list from cs.DS) [ pdf , ps , html , other ]
-
Title: Contract Scheduling with Distributional and Multiple AdviceComments: To appear in Proceedings of IJCAI 2024Subjects: Data Structures and Algorithms (cs.DS) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Contract scheduling is a widely studied framework for designing real-time systems with interruptible capabilities. Previous work has showed that a prediction on the interruption time can help improve the performance of contract-based systems, however it has relied on a single prediction that is provided by a deterministic oracle. In this work, we introduce and study more general and realistic learning-augmented settings in which the prediction is in the form of a probability distribution, or it is given as a set of multiple possible interruption times. For both prediction settings, we design and analyze schedules which perform optimally if the prediction is accurate, while simultaneously guaranteeing the best worst-case performance if the prediction is adversarial. We also provide evidence that the resulting system is robust to prediction errors in the distributional setting. Last, we present an experimental evaluation that confirms the theoretical findings, and illustrates the performance improvements that can be attained in practice.
- [1652] arXiv:2404.12486 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Follow-Me AI: Energy-Efficient User Interaction with Smart EnvironmentsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Abstract: This article introduces Follow-Me AI, a concept designed to enhance user interactions with smart environments, optimize energy use, and provide better control over data captured by these environments. Through AI agents that accompany users, Follow-Me AI negotiates data management based on user consent, aligns environmental controls as well as user communication and computes resources available in the environment with user preferences, and predicts user behavior to proactively adjust the smart environment. The manuscript illustrates this concept with a detailed example of Follow-Me AI in a smart campus setting, detailing the interactions with the building's management system for optimal comfort and efficiency. Finally, this article looks into the challenges and opportunities related to Follow-Me AI.
- [1653] arXiv:2404.12488 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Global Counterfactual DirectionsComments: PreprintSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Despite increasing progress in development of methods for generating visual counterfactual explanations, especially with the recent rise of Denoising Diffusion Probabilistic Models, previous works consider them as an entirely local technique. In this work, we take the first step at globalizing them. Specifically, we discover that the latent space of Diffusion Autoencoders encodes the inference process of a given classifier in the form of global directions. We propose a novel proxy-based approach that discovers two types of these directions with the use of only single image in an entirely black-box manner. Precisely, g-directions allow for flipping the decision of a given classifier on an entire dataset of images, while h-directions further increase the diversity of explanations. We refer to them in general as Global Counterfactual Directions (GCDs). Moreover, we show that GCDs can be naturally combined with Latent Integrated Gradients resulting in a new black-box attribution method, while simultaneously enhancing the understanding of counterfactual explanations. We validate our approach on existing benchmarks and show that it generalizes to real-world use-cases.
- [1654] arXiv:2404.12491 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation ExtractionComments: Work in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Information extraction (IE) is an important task in Natural Language Processing (NLP), involving the extraction of named entities and their relationships from unstructured text. In this paper, we propose a novel approach to this task by formulating it as graph structure learning (GSL). By formulating IE as GSL, we enhance the model's ability to dynamically refine and optimize the graph structure during the extraction process. This formulation allows for better interaction and structure-informed decisions for entity and relation prediction, in contrast to previous models that have separate or untied predictions for these tasks. When compared against state-of-the-art baselines on joint entity and relation extraction benchmarks, our model, GraphER, achieves competitive results.
- [1655] arXiv:2404.12493 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EnriCo: Enriched Representation and Globally Constrained Inference for Entity and Relation ExtractionComments: Work in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Joint entity and relation extraction plays a pivotal role in various applications, notably in the construction of knowledge graphs. Despite recent progress, existing approaches often fall short in two key aspects: richness of representation and coherence in output structure. These models often rely on handcrafted heuristics for computing entity and relation representations, potentially leading to loss of crucial information. Furthermore, they disregard task and/or dataset-specific constraints, resulting in output structures that lack coherence. In our work, we introduce EnriCo, which mitigates these shortcomings. Firstly, to foster rich and expressive representation, our model leverage attention mechanisms that allow both entities and relations to dynamically determine the pertinent information required for accurate extraction. Secondly, we introduce a series of decoding algorithms designed to infer the highest scoring solutions while adhering to task and dataset-specific constraints, thus promoting structured and coherent outputs. Our model demonstrates competitive performance compared to baselines when evaluated on Joint IE datasets.
- [1656] arXiv:2404.12498 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Configurable Pythonic Data Center Model for Sustainable Cooling and ML IntegrationAvisek Naug , Antonio Guillen , Ricardo Luna Gutierrez , Vineet Gundecha , Sahand Ghorbanpour , Sajad Mousavi , Ashwin Ramesh Babu , Soumyendu SarkarComments: NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning this https URL . arXiv admin note: substantial text overlap with arXiv:2310.03906Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: There have been growing discussions on estimating and subsequently reducing the operational carbon footprint of enterprise data centers. The design and intelligent control for data centers have an important impact on data center carbon footprint. In this paper, we showcase PyDCM, a Python library that enables extremely fast prototyping of data center design and applies reinforcement learning-enabled control with the purpose of evaluating key sustainability metrics including carbon footprint, energy consumption, and observing temperature hotspots. We demonstrate these capabilities of PyDCM and compare them to existing works in EnergyPlus for modeling data centers. PyDCM can also be used as a standalone Gymnasium environment for demonstrating sustainability-focused data center control.
- [1657] arXiv:2404.12509 (cross-list from cs.GR) [ pdf , ps , html , other ]
-
Title: Compositional Neural TexturesSubjects: Graphics (cs.GR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Texture plays a vital role in enhancing visual richness in both real photographs and computer-generated imagery. However, the process of editing textures often involves laborious and repetitive manual adjustments of textons, which are the small, recurring local patterns that define textures. In this work, we introduce a fully unsupervised approach for representing textures using a compositional neural model that captures individual textons. We represent each texton as a 2D Gaussian function whose spatial support approximates its shape, and an associated feature that encodes its detailed appearance. By modeling a texture as a discrete composition of Gaussian textons, the representation offers both expressiveness and ease of editing. Textures can be edited by modifying the compositional Gaussians within the latent space, and new textures can be efficiently synthesized by feeding the modified Gaussians through a generator network in a feed-forward manner. This approach enables a wide range of applications, including transferring appearance from an image texture to another image, diversifying textures, texture interpolation, revealing/modifying texture variations, edit propagation, texture animation, and direct texton manipulation. The proposed approach contributes to advancing texture analysis, modeling, and editing techniques, and opens up new possibilities for creating visually appealing images with controllable textures.
- [1658] arXiv:2404.12522 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Neural Active Learning Beyond BanditsComments: Published on ICLR 2024, 40 PagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We study both stream-based and pool-based active learning with neural network approximations. A recent line of works proposed bandit-based approaches that transformed active learning into a bandit problem, achieving both theoretical and empirical success. However, the performance and computational costs of these methods may be susceptible to the number of classes, denoted as $K$, due to this transformation. Therefore, this paper seeks to answer the question: "How can we mitigate the adverse impacts of $K$ while retaining the advantages of principled exploration and provable performance guarantees in active learning?" To tackle this challenge, we propose two algorithms based on the newly designed exploitation and exploration neural networks for stream-based and pool-based active learning. Subsequently, we provide theoretical performance guarantees for both algorithms in a non-parametric setting, demonstrating a slower error-growth rate concerning $K$ for the proposed approaches. We use extensive experiments to evaluate the proposed algorithms, which consistently outperform state-of-the-art baselines.
- [1659] arXiv:2404.12535 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HalluciBot: Is There No Such Thing as a Bad Question?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). In this context, an overwhelming number of studies have focused on analyzing the post-generation phase - refining outputs via feedback, analyzing logit output values, or deriving clues via the outputs' artifacts. We propose HalluciBot, a model that predicts the probability of hallucination $\textbf{before generation}$, for any query imposed to an LLM. In essence, HalluciBot does not invoke any generation during inference. To derive empirical evidence for HalluciBot, we employ a Multi-Agent Monte Carlo Simulation using a Query Perturbator to craft $n$ variations per query at train time. The construction of our Query Perturbator is motivated by our introduction of a new definition of hallucination - $\textit{truthful hallucination}$. Our training methodology generated 2,219,022 estimates for a training corpus of 369,837 queries, spanning 13 diverse datasets and 3 question-answering scenarios. HalluciBot predicts both binary and multi-class probabilities of hallucination, enabling a means to judge the query's quality with regards to its propensity to hallucinate. Therefore, HalluciBot paves the way to revise or cancel a query before generation and the ensuing computational waste. Moreover, it provides a lucid means to measure user accountability for hallucinatory queries.
- [1660] arXiv:2404.12569 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multi-View Subgraph Neural Networks: Self-Supervised Learning with Scarce Labeled DataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: While graph neural networks (GNNs) have become the de-facto standard for graph-based node classification, they impose a strong assumption on the availability of sufficient labeled samples. This assumption restricts the classification performance of prevailing GNNs on many real-world applications suffering from low-data regimes. Specifically, features extracted from scarce labeled nodes could not provide sufficient supervision for the unlabeled samples, leading to severe over-fitting. In this work, we point out that leveraging subgraphs to capture long-range dependencies can augment the representation of a node with homophily properties, thus alleviating the low-data regime. However, prior works leveraging subgraphs fail to capture the long-range dependencies among nodes. To this end, we present a novel self-supervised learning framework, called multi-view subgraph neural networks (Muse), for handling long-range dependencies. In particular, we propose an information theory-based identification mechanism to identify two types of subgraphs from the views of input space and latent space, respectively. The former is to capture the local structure of the graph, while the latter captures the long-range dependencies among nodes. By fusing these two views of subgraphs, the learned representations can preserve the topological properties of the graph at large, including the local structure and long-range dependencies, thus maximizing their expressiveness for downstream node classification tasks. Experimental results show that Muse outperforms the alternative methods on node classification tasks with limited labeled data.
- [1661] arXiv:2404.12580 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: iTBLS: A Dataset of Interactive Conversations Over Tabular InformationComments: 14 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper introduces Interactive Tables (iTBLS), a dataset of interactive conversations situated in tables from scientific articles. This dataset is designed to facilitate human-AI collaborative problem-solving through AI-powered multi-task tabular capabilities. In contrast to prior work that models interactions as factoid QA or procedure synthesis, iTBLS broadens the scope of interactions to include mathematical reasoning, natural language manipulation, and expansion of existing tables from natural language conversation by delineating interactions into one of three tasks: interpretation, modification, or generation. Additionally, the paper presents a suite of baseline approaches to iTBLS, utilizing zero-shot prompting and parameter-efficient fine-tuning for different computing situations. We also introduce a novel multi-step approach and show how it can be leveraged in conjunction with parameter-efficient fine-tuning to achieve the state-of-the-art on iTBLS; outperforming standard parameter-efficient fine-tuning by up to 15% on interpretation, 18% on modification, and 38% on generation.
- [1662] arXiv:2404.12594 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Random Network Distillation Based Deep Reinforcement Learning for AGV Path PlanningComments: 6 pages, 8 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: With the flourishing development of intelligent warehousing systems, the technology of Automated Guided Vehicle (AGV) has experienced rapid growth. Within intelligent warehousing environments, AGV is required to safely and rapidly plan an optimal path in complex and dynamic environments. Most research has studied deep reinforcement learning to address this challenge. However, in the environments with sparse extrinsic rewards, these algorithms often converge slowly, learn inefficiently or fail to reach the target. Random Network Distillation (RND), as an exploration enhancement, can effectively improve the performance of proximal policy optimization, especially enhancing the additional intrinsic rewards of the AGV agent which is in sparse reward environments. Moreover, most of the current research continues to use 2D grid mazes as experimental environments. These environments have insufficient complexity and limited action sets. To solve this limitation, we present simulation environments of AGV path planning with continuous actions and positions for AGVs, so that it can be close to realistic physical scenarios. Based on our experiments and comprehensive analysis of the proposed method, the results demonstrate that our proposed method enables AGV to more rapidly complete path planning tasks with continuous actions in our environments. A video of part of our experiments can be found at this https URL .
- [1663] arXiv:2404.12596 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Parameter Efficient Diverse Paraphrase Generation Using Sequence-Level Knowledge DistillationComments: Published in: 2024 5th International Conference on Advancements in Computational Sciences (ICACS) with IEEESubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Over the past year, the field of Natural Language Generation (NLG) has experienced an exponential surge, largely due to the introduction of Large Language Models (LLMs). These models have exhibited the most effective performance in a range of domains within the Natural Language Processing and Generation domains. However, their application in domain-specific tasks, such as paraphrasing, presents significant challenges. The extensive number of parameters makes them difficult to operate on commercial hardware, and they require substantial time for inference, leading to high costs in a production setting. In this study, we tackle these obstacles by employing LLMs to develop three distinct models for the paraphrasing field, applying a method referred to as sequence-level knowledge distillation. These distilled models are capable of maintaining the quality of paraphrases generated by the LLM. They demonstrate faster inference times and the ability to generate diverse paraphrases of comparable quality. A notable characteristic of these models is their ability to exhibit syntactic diversity while also preserving lexical diversity, features previously uncommon due to existing data quality issues in datasets and not typically observed in neural-based approaches. Human evaluation of our models shows that there is only a 4% drop in performance compared to the LLM teacher model used in the distillation process, despite being 1000 times smaller. This research provides a significant contribution to the NLG field, offering a more efficient and cost-effective solution for paraphrasing tasks.
- [1664] arXiv:2404.12618 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual ScriptsComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.
- [1665] arXiv:2404.12627 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Soft e-Textile Sensor for Enhanced Deep Learning-based Shape Sensing of Soft Continuum RobotsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The safety and accuracy of robotic navigation hold paramount importance, especially in the realm of soft continuum robotics, where the limitations of traditional rigid sensors become evident. Encoders, piezoresistive, and potentiometer sensors often fail to integrate well with the flexible nature of these robots, adding unwanted bulk and rigidity. To overcome these hurdles, our study presents a new approach to shape sensing in soft continuum robots through the use of soft e-textile resistive sensors. This sensor, designed to flawlessly integrate with the robot's structure, utilizes a resistive material that adjusts its resistance in response to the robot's movements and deformations. This adjustment facilitates the capture of multidimensional force measurements across the soft sensor layers. A deep Convolutional Neural Network (CNN) is employed to decode the sensor signals, enabling precise estimation of the robot's shape configuration based on the detailed data from the e-textile sensor. Our research investigates the efficacy of this e-textile sensor in determining the curvature parameters of soft continuum robots. The findings are encouraging, showing that the soft e-textile sensor not only matches but potentially exceeds the capabilities of traditional rigid sensors in terms of shape sensing and estimation. This advancement significantly boosts the safety and efficiency of robotic navigation systems.
- [1666] arXiv:2404.12631 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Breaching the Bottleneck: Evolutionary Transition from Reward-Driven Learning to Reward-Agnostic Domain-Adapted Learning in Neuromodulated Neural NetsComments: 9 pages, 5 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to acquire fit behaviour. This imposes an information bottleneck that precludes learning from diverse non-reward stimulus information, limiting learning efficiency. We consider the question of how biological evolution circumvents this bottleneck to produce DAL. We propose that species first evolve the ability to learn from reward signals, providing inefficient (bottlenecked) but broad adaptivity. From there, integration of non-reward information into the learning process can proceed via gradual accumulation of biases induced by such information on specific task domains. This scenario provides a biologically plausible pathway towards bottleneck-free, domain-adapted learning. Focusing on the second phase of this scenario, we set up a population of NNs with reward-driven learning modelled as Reinforcement Learning (A2C), and allow evolution to improve learning efficiency by integrating non-reward information into the learning process using a neuromodulatory update mechanism. On a navigation task in continuous 2D space, evolved DAL agents show a 300-fold increase in learning speed compared to pure RL agents. Evolution is found to eliminate reliance on reward information altogether, allowing DAL agents to learn from non-reward information exclusively, using local neuromodulation-based connection weight updates only.
- [1667] arXiv:2404.12634 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Transformer-Based Classification Outcome Prediction for Multimodal Stroke TreatmentSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study proposes a multi-modal fusion framework Multitrans based on the Transformer architecture and self-attention mechanism. This architecture combines the study of non-contrast computed tomography (NCCT) images and discharge diagnosis reports of patients undergoing stroke treatment, using a variety of methods based on Transformer architecture approach to predicting functional outcomes of stroke treatment. The results show that the performance of single-modal text classification is significantly better than single-modal image classification, but the effect of multi-modal combination is better than any single modality. Although the Transformer model only performs worse on imaging data, when combined with clinical meta-diagnostic information, both can learn better complementary information and make good contributions to accurately predicting stroke treatment effects..
- [1668] arXiv:2404.12652 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Pre-trained Vision-Language Models Learn Discoverable Visual ConceptsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.
- [1669] arXiv:2404.12667 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Detecting Out-Of-Distribution Earth Observation Images with Diffusion ModelsGeorges Le Bellier (CEDRIC - VERTIGO, CNAM), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, IGN)Comments: EARTHVISION 2024 IEEE/CVF CVPR Workshop. Large Scale Computer Vision for Remote Sensing Imagery, Jun 2024, Seattle, United StatesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Earth Observation imagery can capture rare and unusual events, such as disasters and major landscape changes, whose visual appearance contrasts with the usual observations. Deep models trained on common remote sensing data will output drastically different features for these out-of-distribution samples, compared to those closer to their training dataset. Detecting them could therefore help anticipate changes in the observations, either geographical or environmental. In this work, we show that the reconstruction error of diffusion models can effectively serve as unsupervised out-of-distribution detectors for remote sensing images, using them as a plausibility score. Moreover, we introduce ODEED, a novel reconstruction-based scorer using the probability-flow ODE of diffusion models. We validate it experimentally on SpaceNet 8 with various scenarios, such as classical OOD detection with geographical shift and near-OOD setups: pre/post-flood and non-flooded/flooded image recognition. We show that our ODEED scorer significantly outperforms other diffusion-based and discriminative baselines on the more challenging near-OOD scenarios of flood image detection, where OOD images are close to the distribution tail. We aim to pave the way towards better use of generative models for anomaly detection in remote sensing.
- [1670] arXiv:2404.12689 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Can LLMs Understand Computer Networks? Towards a Virtual System AdministratorSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Recent advancements in Artificial Intelligence, and particularly Large Language Models (LLMs), offer promising prospects for aiding system administrators in managing the complexity of modern networks. However, despite this potential, a significant gap exists in the literature regarding the extent to which LLMs can understand computer networks. Without empirical evidence, system administrators might rely on these models without assurance of their efficacy in performing network-related tasks accurately.
In this paper, we are the first to conduct an exhaustive study on LLMs' comprehension of computer networks. We formulate several research questions to determine whether LLMs can provide correct answers when supplied with a network topology and questions on it. To assess them, we developed a thorough framework for evaluating LLMs' capabilities in various network-related tasks. We evaluate our framework on multiple computer networks employing private (e.g., GPT4) and open-source (e.g., Llama2) models. Our findings demonstrate promising results, with the best model achieving an average accuracy of 79.3%. Private LLMs achieve noteworthy results in small and medium networks, while challenges persist in comprehending complex network topologies, particularly for open-source models. Moreover, we provide insight into how prompt engineering can enhance the accuracy of some tasks. - [1671] arXiv:2404.12711 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Dynamic Temperature Knowledge DistillationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.The code is available at this https URL .
- [1672] arXiv:2404.12712 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: uTRAND: Unsupervised Anomaly Detection in Traffic TrajectoriesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep learning-based approaches have achieved significant improvements on public video anomaly datasets, but often do not perform well in real-world applications. This paper addresses two issues: the lack of labeled data and the difficulty of explaining the predictions of a neural network. To this end, we present a framework called uTRAND, that shifts the problem of anomalous trajectory prediction from the pixel space to a semantic-topological domain. The framework detects and tracks all types of traffic agents in bird's-eye-view videos of traffic cameras mounted at an intersection. By conceptualizing the intersection as a patch-based graph, it is shown that the framework learns and models the normal behaviour of traffic agents without costly manual labeling. Furthermore, uTRAND allows to formulate simple rules to classify anomalous trajectories in a way suited for human interpretation. We show that uTRAND outperforms other state-of-the-art approaches on a dataset of anomalous trajectories collected in a real-world setting, while producing explainable detection results.
- [1673] arXiv:2404.12717 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Show and Grasp: Few-shot Semantic Segmentation for Robot Grasping through Zero-shot Foundation ModelsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The ability of a robot to pick an object, known as robot grasping, is crucial for several applications, such as assembly or sorting. In such tasks, selecting the right target to pick is as essential as inferring a correct configuration of the gripper. A common solution to this problem relies on semantic segmentation models, which often show poor generalization to unseen objects and require considerable time and massive data to be trained. To reduce the need for large datasets, some grasping pipelines exploit few-shot semantic segmentation models, which are capable of recognizing new classes given a few examples. However, this often comes at the cost of limited performance and fine-tuning is required to be effective in robot grasping scenarios. In this work, we propose to overcome all these limitations by combining the impressive generalization capability reached by foundation models with a high-performing few-shot classifier, working as a score function to select the segmentation that is closer to the support set. The proposed model is designed to be embedded in a grasp synthesis pipeline. The extensive experiments using one or five examples show that our novel approach overcomes existing performance limitations, improving the state of the art both in few-shot semantic segmentation on the Graspnet-1B (+10.5% mIoU) and Ocid-grasp (+1.6% AP) datasets, and real-world few-shot grasp synthesis (+21.7% grasp accuracy). The project page is available at: this https URL
- [1674] arXiv:2404.12721 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Generalized Few-Shot Meets Remote Sensing: Discovering Novel Classes in Land Cover Mapping via Hybrid Semantic Segmentation FrameworkComments: 11 pages, 11 figures, accepted by CVPR 2024 L3D-IVU WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Land-cover mapping is one of the vital applications in Earth observation, aiming at classifying each pixel's land-cover type of remote-sensing images. As natural and human activities change the landscape, the land-cover map needs to be rapidly updated. However, discovering newly appeared land-cover types in existing classification systems is still a non-trivial task hindered by various scales of complex land objects and insufficient labeled data over a wide-span geographic area. In this paper, we propose a generalized few-shot segmentation-based framework, named SegLand, to update novel classes in high-resolution land-cover mapping. Specifically, the proposed framework is designed in three parts: (a) Data pre-processing: the base training set and the few-shot support sets of novel classes are analyzed and augmented; (b) Hybrid segmentation structure; Multiple base learners and a modified Projection onto Orthogonal Prototypes (POP) network are combined to enhance the base-class recognition and to dig novel classes from insufficient labels data; (c) Ultimate fusion: the semantic segmentation results of the base learners and POP network are reasonably fused. The proposed framework has won first place in the leaderboard of the OpenEarthMap Land Cover Mapping Few-Shot Challenge. Experiments demonstrate the superiority of the framework for automatically updating novel land-cover classes with limited labeled data.
- [1675] arXiv:2404.12744 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary ApproachesComments: 16 pages, work in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in Large Language Models (LLMs) have revolutionized the AI field but also pose potential safety and ethical risks. Deciphering LLMs' embedded values becomes crucial for assessing and mitigating their risks. Despite extensive investigation into LLMs' values, previous studies heavily rely on human-oriented value systems in social sciences. Then, a natural question arises: Do LLMs possess unique values beyond those of humans? Delving into it, this work proposes a novel framework, ValueLex, to reconstruct LLMs' unique value system from scratch, leveraging psychological methodologies from human personality/value research. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs, synthesizing a taxonomy that culminates in a comprehensive value framework via factor analysis and semantic clustering. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system. Based on this system, we further develop tailored projective tests to evaluate and analyze the value inclinations of LLMs across different model sizes, training methods, and data sources. Our framework fosters an interdisciplinary paradigm of understanding LLMs, paving the way for future AI alignment and regulation.
- [1676] arXiv:2404.12753 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AutoCrawler: A Progressive Understanding Web Agent for Web Crawler GenerationComments: 18 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{ this https URL }
- [1677] arXiv:2404.12754 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman EquationComments: Accepted to CVPR23; Code: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at this https URL .
- [1678] arXiv:2404.12768 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MixLight: Borrowing the Best of both Spherical Harmonics and Gaussian ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: Accurately estimating scene lighting is critical for applications such as mixed reality. Existing works estimate illumination by generating illumination maps or regressing illumination parameters. However, the method of generating illumination maps has poor generalization performance and parametric models such as Spherical Harmonic (SH) and Spherical Gaussian (SG) fall short in capturing high-frequency or low-frequency components. This paper presents MixLight, a joint model that utilizes the complementary characteristics of SH and SG to achieve a more complete illumination representation, which uses SH and SG to capture low-frequency ambient and high-frequency light sources respectively. In addition, a special spherical light source sparsemax (SLSparsemax) module that refers to the position and brightness relationship between spherical light sources is designed to improve their sparsity, which is significant but omitted by prior works. Extensive experiments demonstrate that MixLight surpasses state-of-the-art (SOTA) methods on multiple metrics. In addition, experiments on Web Dataset also show that MixLight as a parametric method has better generalization performance than non-parametric methods.
- [1679] arXiv:2404.12782 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video CommentingComments: 27 pages, 10 figures, ACM Transactions on Multimedia Computing, Communications and Applications, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from the current methods. Sentimental factors are critical in interactive commenting, and lack of research so far. Thus, in this paper, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at this https URL .
- [1680] arXiv:2404.12792 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Learning of Fuzzy Logic Systems for Large-Scale Data Using Deep LearningComments: in the International Conference on Intelligent and Fuzzy Systems, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Type-1 and Interval Type-2 (IT2) Fuzzy Logic Systems (FLS) excel in handling uncertainty alongside their parsimonious rule-based structure. Yet, in learning large-scale data challenges arise, such as the curse of dimensionality and training complexity of FLSs. The complexity is due mainly to the constraints to be satisfied as the learnable parameters define FSs and the complexity of the center of the sets calculation method, especially of IT2-FLSs. This paper explicitly focuses on the learning problem of FLSs and presents a computationally efficient learning method embedded within the realm of Deep Learning (DL). The proposed method tackles the learning challenges of FLSs by presenting computationally efficient implementations of FLSs, thereby minimizing training time while leveraging mini-batched DL optimizers and automatic differentiation provided within the DL frameworks. We illustrate the efficiency of the DL framework for FLSs on benchmark datasets.
- [1681] arXiv:2404.12800 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Zadeh's Type-2 Fuzzy Logic Systems: Precision and High-Quality Prediction IntervalsComments: in the IEEE World Congress on Computational Intelligence, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: General Type-2 (GT2) Fuzzy Logic Systems (FLSs) are perfect candidates to quantify uncertainty, which is crucial for informed decisions in high-risk tasks, as they are powerful tools in representing uncertainty. In this paper, we travel back in time to provide a new look at GT2-FLSs by adopting Zadeh's (Z) GT2 Fuzzy Set (FS) definition, intending to learn GT2-FLSs that are capable of achieving reliable High-Quality Prediction Intervals (HQ-PI) alongside precision. By integrating Z-GT2-FS with the \(\alpha\)-plane representation, we show that the design flexibility of GT2-FLS is increased as it takes away the dependency of the secondary membership function from the primary membership function. After detailing the construction of Z-GT2-FLSs, we provide solutions to challenges while learning from high-dimensional data: the curse of dimensionality, and integrating Deep Learning (DL) optimizers. We develop a DL framework for learning dual-focused Z-GT2-FLSs with high performances. Our study includes statistical analyses, highlighting that the Z-GT2-FLS not only exhibits high-precision performance but also produces HQ-PIs in comparison to its GT2 and IT2 fuzzy counterparts which have more learnable parameters. The results show that the Z-GT2-FLS has a huge potential in uncertainty quantification.
- [1682] arXiv:2404.12802 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Enhancing Interval Type-2 Fuzzy Logic Systems: Learning for Precision and Prediction IntervalsComments: in the IEEE World Congress on Computational Intelligence, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we tackle the task of generating Prediction Intervals (PIs) in high-risk scenarios by proposing enhancements for learning Interval Type-2 (IT2) Fuzzy Logic Systems (FLSs) to address their learning challenges. In this context, we first provide extra design flexibility to the Karnik-Mendel (KM) and Nie-Tan (NT) center of sets calculation methods to increase their flexibility for generating PIs. These enhancements increase the flexibility of KM in the defuzzification stage while the NT in the fuzzification stage. To address the large-scale learning challenge, we transform the IT2-FLS's constraint learning problem into an unconstrained form via parameterization tricks, enabling the direct application of deep learning optimizers. To address the curse of dimensionality issue, we expand the High-Dimensional Takagi-Sugeno-Kang (HTSK) method proposed for type-1 FLS to IT2-FLSs, resulting in the HTSK2 approach. Additionally, we introduce a framework to learn the enhanced IT2-FLS with a dual focus, aiming for high precision and PI generation. Through exhaustive statistical results, we reveal that HTSK2 effectively addresses the dimensionality challenge, while the enhanced KM and NT methods improved learning and enhanced uncertainty quantification performances of IT2-FLSs.
- [1683] arXiv:2404.12810 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Enhancing Counterfactual Explanation Search with Diffusion Distance and Directional CoherenceComments: This work has been accepted to be presented to The 2nd World Conference on eXplainable Artificial Intelligence (xAI 2024), July 17-19, 2024 - Valletta, MaltaSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A pressing issue in the adoption of AI models is the increasing demand for more human-centric explanations of their predictions. To advance towards more human-centric explanations, understanding how humans produce and select explanations has been beneficial. In this work, inspired by insights of human cognition we propose and test the incorporation of two novel biases to enhance the search for effective counterfactual explanations. Central to our methodology is the application of diffusion distance, which emphasizes data connectivity and actionability in the search for feasible counterfactual explanations. In particular, diffusion distance effectively weights more those points that are more interconnected by numerous short-length paths. This approach brings closely connected points nearer to each other, identifying a feasible path between them. We also introduce a directional coherence term that allows the expression of a preference for the alignment between the joint and marginal directional changes in feature space to reach a counterfactual. This term enables the generation of counterfactual explanations that align with a set of marginal predictions based on expectations of how the outcome of the model varies by changing one feature at a time. We evaluate our method, named Coherent Directional Counterfactual Explainer (CoDiCE), and the impact of the two novel biases against existing methods such as DiCE, FACE, Prototypes, and Growing Spheres. Through a series of ablation experiments on both synthetic and real datasets with continuous and mixed-type features, we demonstrate the effectiveness of our method.
- [1684] arXiv:2404.12814 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative Modelling with High-Order Langevin DynamicsComments: Some of the results in this paper have been published or accepted at conferences such as wacv2024, icassp2024, and icme2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Diffusion generative modelling (DGM) based on stochastic differential equations (SDEs) with score matching has achieved unprecedented results in data generation. In this paper, we propose a novel fast high-quality generative modelling method based on high-order Langevin dynamics (HOLD) with score matching. This motive is proved by third-order Langevin dynamics. By augmenting the previous SDEs, e.g. variance exploding or variance preserving SDEs for single-data variable processes, HOLD can simultaneously model position, velocity, and acceleration, thereby improving the quality and speed of the data generation at the same time. HOLD is composed of one Ornstein-Uhlenbeck process and two Hamiltonians, which reduce the mixing time by two orders of magnitude. Empirical experiments for unconditional image generation on the public data set CIFAR-10 and CelebA-HQ show that the effect is significant in both Frechet inception distance (FID) and negative log-likelihood, and achieves the state-of-the-art FID of 1.85 on CIFAR-10.
- [1685] arXiv:2404.12832 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical imagesComments: This work has been accepted to be presented to The 2nd World Conference on eXplainable Artificial Intelligence (xAI 2024), July 17-19, 2024 - Valletta, MaltaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
- [1686] arXiv:2404.12839 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ECOR: Explainable CLIP for Object RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their black-box nature and lack of explainability in predictions make them less trustworthy in critical domains. Recently, some work has been done to force VLMs to provide reasonable rationales for object recognition, but this often comes at the expense of classification accuracy. In this paper, we first propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluations of different datasets, our method demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. The code will be made available online upon publication.
- [1687] arXiv:2404.12856 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Language-Driven Active Learning for Diverse Open-Set 3D Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Object detection is crucial for ensuring safe autonomous driving. However, data-driven approaches face challenges when encountering minority or novel objects in the 3D driving scene. In this paper, we propose VisLED, a language-driven active learning framework for diverse open-set 3D Object Detection. Our method leverages active learning techniques to query diverse and informative data samples from an unlabeled pool, enhancing the model's ability to detect underrepresented or novel objects. Specifically, we introduce the Vision-Language Embedding Diversity Querying (VisLED-Querying) algorithm, which operates in both open-world exploring and closed-world mining settings. In open-world exploring, VisLED-Querying selects data points most novel relative to existing data, while in closed-world mining, it mines new instances of known classes. We evaluate our approach on the nuScenes dataset and demonstrate its effectiveness compared to random sampling and entropy-querying methods. Our results show that VisLED-Querying consistently outperforms random sampling and offers competitive performance compared to entropy-querying despite the latter's model-optimality, highlighting the potential of VisLED for improving object detection in autonomous driving scenarios.
- [1688] arXiv:2404.12866 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.
- [1689] arXiv:2404.12876 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Large-scale Medical Visual Task Adaptation BenchmarkSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such as color images, X-ray, and CT. To close this gap, we present Med-VTAB, a large-scale Medical Visual Task Adaptation Benchmark consisting of 1.68 million medical images for diverse organs, modalities, and adaptation approaches. Based on Med-VTAB, we explore the scaling law of medical prompt tuning concerning tunable parameters and the generalizability of medical visual adaptation using non-medical/medical pre-train weights. Besides, we study the impact of patient ID out-of-distribution on medical visual adaptation, which is a real and challenging scenario. Furthermore, results from Med-VTAB indicate that a single pre-trained model falls short in medical task adaptation. Therefore, we introduce GMoE-Adapter, a novel method that combines medical and general pre-training weights through a gated mixture-of-experts adapter, achieving state-of-the-art results in medical visual task adaptation.
- [1690] arXiv:2404.12900 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Training-and-prompt-free General Painterly Harmonization Using Image-wise Attention SharingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Painterly Image Harmonization aims at seamlessly blending disparate visual elements within a single coherent image. However, previous approaches often encounter significant limitations due to training data constraints, the need for time-consuming fine-tuning, or reliance on additional prompts. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method using image-wise attention sharing (TF-GPH), which integrates a novel "share-attention module". This module redefines the traditional self-attention mechanism by allowing for comprehensive image-wise attention, facilitating the use of a state-of-the-art pretrained latent diffusion model without the typical training data limitations. Additionally, we further introduce "similarity reweighting" mechanism enhances performance by effectively harnessing cross-image information, surpassing the capabilities of fine-tuning or prompt-based approaches. At last, we recognize the deficiencies in existing benchmarks and propose the "General Painterly Harmonization Benchmark", which employs range-based evaluation metrics to more accurately reflect real-world application. Extensive experiments demonstrate the superior efficacy of our method across various benchmarks. The code and web demo are available at this https URL .
- [1691] arXiv:2404.12901 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Large Language Models for Networking: Workflow, Advances and ChallengesSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: The networking field is characterized by its high complexity and rapid iteration, requiring extensive expertise to accomplish network tasks, ranging from network design, configuration, diagnosis and security. The inherent complexity of these tasks, coupled with the ever-changing landscape of networking technologies and protocols, poses significant hurdles for traditional machine learning-based methods. These methods often struggle to generalize and automate complex tasks in networking, as they require extensive labeled data, domain-specific feature engineering, and frequent retraining to adapt to new scenarios. However, the recent emergence of large language models (LLMs) has sparked a new wave of possibilities in addressing these challenges. LLMs have demonstrated remarkable capabilities in natural language understanding, generation, and reasoning. These models, trained on extensive data, can benefit the networking domain. Some efforts have already explored the application of LLMs in the networking domain and revealed promising results. By reviewing recent advances, we present an abstract workflow to describe the fundamental process involved in applying LLM for Networking. We introduce the highlights of existing works by category and explain in detail how they operate at different stages of the workflow. Furthermore, we delve into the challenges encountered, discuss potential solutions, and outline future research prospects. We hope that this survey will provide insight for researchers and practitioners, promoting the development of this interdisciplinary research field.
- [1692] arXiv:2404.12917 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Zero-Shot Stitching in Reinforcement Learning using Relative RepresentationsComments: 13 pages, 10 figures, 4 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Visual Reinforcement Learning is a popular and powerful framework that takes full advantage of the Deep Learning breakthrough. However, it is also known that variations in the input (e.g., different colors of the panorama due to the season of the year) or the task (e.g., changing the speed limit for a car to respect) could require complete retraining of the agents. In this work, we leverage recent developments in unifying latent representations to demonstrate that it is possible to combine the components of an agent, rather than retrain it from scratch. We build upon the recent relative representations framework and adapt it for Visual RL. This allows us to create completely new agents capable of handling environment-task combinations never seen during training. Our work paves the road toward a more accessible and flexible use of reinforcement learning.
- [1693] arXiv:2404.12922 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we introduce Selective-distillation for Class and Architecture-agnostic unleaRning (SCAR), a novel approximate unlearning method. SCAR efficiently eliminates specific information while preserving the model's test accuracy without using a retain set, which is a key component in state-of-the-art approximate unlearning algorithms. Our approach utilizes a modified Mahalanobis distance to guide the unlearning of the feature vectors of the instances to be forgotten, aligning them to the nearest wrong class distribution. Moreover, we propose a distillation-trick mechanism that distills the knowledge of the original model into the unlearning model with out-of-distribution images for retaining the original model's test performance without using any retain set. Importantly, we propose a self-forget version of SCAR that unlearns without having access to the forget set. We experimentally verified the effectiveness of our method, on three public datasets, comparing it with state-of-the-art methods. Our method obtains performance higher than methods that operate without the retain set and comparable w.r.t the best methods that rely on the retain set.
- [1694] arXiv:2404.12928 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The Positivity of the Neural Tangent KernelComments: Comments welcomeSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Probability (math.PR); Spectral Theory (math.SP)
Abstract: The Neural Tangent Kernel (NTK) has emerged as a fundamental concept in the study of wide Neural Networks. In particular, it is known that the positivity of the NTK is directly related to the memorization capacity of sufficiently wide networks, i.e., to the possibility of reaching zero loss in training, via gradient descent. Here we will improve on previous works and obtain a sharp result concerning the positivity of the NTK of feedforward networks of any depth. More precisely, we will show that, for any non-polynomial activation function, the NTK is strictly positive definite. Our results are based on a novel characterization of polynomial functions which is of independent interest.
- [1695] arXiv:2404.12933 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.
- [1696] arXiv:2404.12938 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MAiDE-up: Multilingual Deception Detection of GPT-generated Hotel ReviewsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs. While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews. Moreover, most of the research so far has focused primarily on English, with very little work dedicated to other languages. In this paper, we compile and make publicly available the MAiDE-up dataset, consisting of 10,000 real and 10,000 AI-generated fake hotel reviews, balanced across ten languages. Using this dataset, we conduct extensive linguistic analyses to (1) compare the AI fake hotel reviews to real hotel reviews, and (2) identify the factors that influence the deception detection model performance. We explore the effectiveness of several models for deception detection in hotel reviews across three main dimensions: sentiment, location, and language. We find that these dimensions influence how well we can detect AI-generated fake reviews.
- [1697] arXiv:2404.12966 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
- [1698] arXiv:2404.12969 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Disentangling ID and Modality Effects for Session-based RecommendationComments: This work has been accepted by SIGIR24' as a full paperSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Session-based recommendation aims to predict intents of anonymous users based on their limited behaviors. Modeling user behaviors involves two distinct rationales: co-occurrence patterns reflected by item IDs, and fine-grained preferences represented by item modalities (e.g., text and images). However, existing methods typically entangle these causes, leading to their failure in achieving accurate and explainable recommendations. To this end, we propose a novel framework DIMO to disentangle the effects of ID and modality in the task. At the item level, we introduce a co-occurrence representation schema to explicitly incorporate cooccurrence patterns into ID representations. Simultaneously, DIMO aligns different modalities into a unified semantic space to represent them uniformly. At the session level, we present a multi-view self-supervised disentanglement, including proxy mechanism and counterfactual inference, to disentangle ID and modality effects without supervised signals. Leveraging these disentangled causes, DIMO provides recommendations via causal inference and further creates two templates for generating explanations. Extensive experiments on multiple real-world datasets demonstrate the consistent superiority of DIMO over existing methods. Further analysis also confirms DIMO's effectiveness in generating explanations.
- [1699] arXiv:2404.12975 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: FineRec:Exploring Fine-grained Sequential RecommendationComments: This work has been accepted by SIGIR24' as a full paperSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Sequential recommendation is dedicated to offering items of interest for users based on their history behaviors. The attribute-opinion pairs, expressed by users in their reviews for items, provide the potentials to capture user preferences and item characteristics at a fine-grained level. To this end, we propose a novel framework FineRec that explores the attribute-opinion pairs of reviews to finely handle sequential recommendation. Specifically, we utilize a large language model to extract attribute-opinion pairs from reviews. For each attribute, a unique attribute-specific user-opinion-item graph is created, where corresponding opinions serve as the edges linking heterogeneous user and item nodes. To tackle the diversity of opinions, we devise a diversity-aware convolution operation to aggregate information within the graphs, enabling attribute-specific user and item representation learning. Ultimately, we present an interaction-driven fusion mechanism to integrate attribute-specific user/item representations across all attributes for generating recommendations. Extensive experiments conducted on several realworld datasets demonstrate the superiority of our FineRec over existing state-of-the-art methods. Further analysis also verifies the effectiveness of our fine-grained manner in handling the task.
- [1700] arXiv:2404.12984 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Eye-tracking in Mixed Reality for Diagnosis of Neurodegenerative DiseasesMateusz Daniol , Daria Hemmerling , Jakub Sikora , Pawel Jemiolo , Marek Wodzinski , Magdalena Wojcik-PedziwiatrSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Parkinson's disease ranks as the second most prevalent neurodegenerative disorder globally. This research aims to develop a system leveraging Mixed Reality capabilities for tracking and assessing eye movements. In this paper, we present a medical scenario and outline the development of an application designed to capture eye-tracking signals through Mixed Reality technology for the evaluation of neurodegenerative diseases. Additionally, we introduce a pipeline for extracting clinically relevant features from eye-gaze analysis, describing the capabilities of the proposed system from a medical perspective. The study involved a cohort of healthy control individuals and patients suffering from Parkinson's disease, showcasing the feasibility and potential of the proposed technology for non-intrusive monitoring of eye movement patterns for the diagnosis of neurodegenerative diseases.
Clinical relevance - Developing a non-invasive biomarker for Parkinson's disease is urgently needed to accurately detect the disease's onset. This would allow for the timely introduction of neuroprotective treatment at the earliest stage and enable the continuous monitoring of intervention outcomes. The ability to detect subtle changes in eye movements allows for early diagnosis, offering a critical window for intervention before more pronounced symptoms emerge. Eye tracking provides objective and quantifiable biomarkers, ensuring reliable assessments of disease progression and cognitive function. The eye gaze analysis using Mixed Reality glasses is wireless, facilitating convenient assessments in both home and hospital settings. The approach offers the advantage of utilizing hardware that requires no additional specialized attachments, enabling examinations through personal eyewear. - [1701] arXiv:2404.12999 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Goal Exploration via Adaptive Skill Distribution for Goal-Conditioned Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Exploration efficiency poses a significant challenge in goal-conditioned reinforcement learning (GCRL) tasks, particularly those with long horizons and sparse rewards. A primary limitation to exploration efficiency is the agent's inability to leverage environmental structural patterns. In this study, we introduce a novel framework, GEASD, designed to capture these patterns through an adaptive skill distribution during the learning process. This distribution optimizes the local entropy of achieved goals within a contextual horizon, enhancing goal-spreading behaviors and facilitating deep exploration in states containing familiar structural patterns. Our experiments reveal marked improvements in exploration efficiency using the adaptive skill distribution compared to a uniform skill distribution. Additionally, the learned skill distribution demonstrates robust generalization capabilities, achieving substantial exploration progress in unseen tasks containing similar local structures.
- [1702] arXiv:2404.13004 (cross-list from cs.CE) [ pdf , ps , html , other ]
-
Title: FinLangNet: A Novel Deep Learning Framework for Credit Risk Prediction Using Linguistic Analogy in Financial DataSubjects: Computational Engineering, Finance, and Science (cs.CE) ; Artificial Intelligence (cs.AI)
Abstract: Recent industrial applications in risk prediction still heavily rely on extensively manually-tuned, statistical learning methods. Real-world financial data, characterized by its high-dimensionality, sparsity, high noise levels, and significant imbalance, poses unique challenges for the effective application of deep neural network models. In this work, we introduce a novel deep learning risk prediction framework, FinLangNet, which conceptualizes credit loan trajectories in a structure that mirrors linguistic constructs. This framework is tailored for credit risk prediction using real-world financial data, drawing on structural similarities to language by adapting natural language processing techniques. It focuses on analyzing the evolution and predictability of credit histories through detailed financial event sequences. Our research demonstrates that FinLangNet surpasses traditional statistical methods in predicting credit risk and that its integration with these methods enhances credit card fraud prediction models, achieving a significant improvement of over 1.5 points in the Kolmogorov-Smirnov metric.
- [1703] arXiv:2404.13013 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Groma: Localized Visual Tokenization for Grounding Multimodal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: this https URL .
- [1704] arXiv:2404.13026 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PhysDreamer: Physics-Based Interaction with 3D Objects via Video GenerationTianyuan Zhang , Hong-Xing Yu , Rundi Wu , Brandon Y. Feng , Changxi Zheng , Noah Snavely , Jiajun Wu , William T. FreemanComments: Project website at: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these properties, such as object stiffness. However, estimating physical material properties is an open problem due to the lack of material ground-truth data, as measuring these properties for real objects is highly difficult. We present PhysDreamer, a physics-based approach that endows static 3D objects with interactive dynamics by leveraging the object dynamics priors learned by video generation models. By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations. We demonstrate our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study. PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner. See our project page at this https URL .
- [1705] arXiv:2404.13028 (cross-list from cs.CE) [ pdf , ps , other ]
-
Title: When Life gives you LLMs, make LLM-ADE: Large Language Models with Adaptive Data EngineeringComments: 6 pages, 3 tables and 3 figuresSubjects: Computational Engineering, Finance, and Science (cs.CE) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents the LLM-ADE framework, a novel methodology for continued pre-training of large language models (LLMs) that addresses the challenges of catastrophic forgetting and double descent. LLM-ADE employs dynamic architectural adjustments, including selective block freezing and expansion, tailored to specific datasets. This strategy enhances model adaptability to new data while preserving previously acquired knowledge. We demonstrate LLM-ADE's effectiveness on the TinyLlama model across various general knowledge benchmarks, showing significant performance improvements without the drawbacks of traditional continuous training methods. This approach promises a more versatile and robust way to keep LLMs current and efficient in real-world applications.
- [1706] arXiv:2404.13050 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FlowMind: Automatic Workflow Generation with LLMsZhen Zeng , William Watson , Nicole Cho , Saba Rahimi , Shayleen Reynolds , Tucker Balch , Manuela VelosoComments: Published in ACM ICAIF 2023Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapidly evolving field of Robotic Process Automation (RPA) has made significant strides in automating repetitive processes, yet its effectiveness diminishes in scenarios requiring spontaneous or unpredictable tasks demanded by users. This paper introduces a novel approach, FlowMind, leveraging the capabilities of Large Language Models (LLMs) such as Generative Pretrained Transformer (GPT), to address this limitation and create an automatic workflow generation system. In FlowMind, we propose a generic prompt recipe for a lecture that helps ground LLM reasoning with reliable Application Programming Interfaces (APIs). With this, FlowMind not only mitigates the common issue of hallucinations in LLMs, but also eliminates direct interaction between LLMs and proprietary data or code, thus ensuring the integrity and confidentiality of information - a cornerstone in financial services. FlowMind further simplifies user interaction by presenting high-level descriptions of auto-generated workflows, enabling users to inspect and provide feedback effectively. We also introduce NCEN-QA, a new dataset in finance for benchmarking question-answering tasks from N-CEN reports on funds. We used NCEN-QA to evaluate the performance of workflows generated by FlowMind against baseline and ablation variants of FlowMind. We demonstrate the success of FlowMind, the importance of each component in the proposed lecture recipe, and the effectiveness of user interaction and feedback in FlowMind.
- [1707] arXiv:2404.13060 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: The Necessity of AI Audit Standards BoardsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Auditing of AI systems is a promising way to understand and manage ethical problems and societal risks associated with contemporary AI systems, as well as some anticipated future risks. Efforts to develop standards for auditing Artificial Intelligence (AI) systems have therefore understandably gained momentum. However, we argue that creating auditing standards is not just insufficient, but actively harmful by proliferating unheeded and inconsistent standards, especially in light of the rapid evolution and ethical and safety challenges of AI. Instead, the paper proposes the establishment of an AI Audit Standards Board, responsible for developing and updating auditing methods and standards in line with the evolving nature of AI technologies. Such a body would ensure that auditing practices remain relevant, robust, and responsive to the rapid advancements in AI. The paper argues that such a governance structure would also be helpful for maintaining public trust in AI and for promoting a culture of safety and ethical responsibility within the AI industry.
Throughout the paper, we draw parallels with other industries, including safety-critical industries like aviation and nuclear energy, as well as more prosaic ones such as financial accounting and pharmaceuticals. AI auditing should emulate those fields, and extend beyond technical assessments to include ethical considerations and stakeholder engagement, but we explain that this is not enough; emulating other fields' governance mechanisms for these processes, and for audit standards creation, is a necessity. We also emphasize the importance of auditing the entire development process of AI systems, not just the final products... - [1708] arXiv:2404.13061 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: FPGA Divide-and-Conquer Placement using Deep Reinforcement LearningComments: accepted by ISEDA2024Subjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper introduces the problem of learning to place logic blocks in Field-Programmable Gate Arrays (FPGAs) and a learning-based method. In contrast to previous search-based placement algorithms, we instead employ Reinforcement Learning (RL) with the goal of minimizing wirelength. In addition to our preliminary learning results, we also evaluated a novel decomposition to address the nature of large search space when placing many blocks on a chipboard. Empirical experiments evaluate the effectiveness of the learning and decomposition paradigms on FPGA placement tasks.
- [1709] arXiv:2404.13065 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Intellecta Cognitiva: A Comprehensive Dataset for Advancing Academic Knowledge and Machine ReasoningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Intellecta dataset emerges as an innovative synthetic dataset, engineered to enhance the cognitive processing capabilities of contemporary language models. With a composition of 11.53 billion tokens, integrating 8.01 billion tokens of synthetic data with 3.52 billion tokens of rich textbook data, Intellecta is crafted to foster advanced reasoning and comprehensive educational narrative generation. Leveraging the Mixtral-8x7B-Instruct-v0.1 model, the dataset facilitates the generation of complex thought processes and detailed, textbook-style explanations, thus enabling language models to engage in both critical thinking and profound educational discourse. This hybrid dataset stands as a testament to the potential of synthetic data in pushing the boundaries of AI, offering a repository that is not only vast and varied but also refined to align with ethical standards and intellectual rigor.
- [1710] arXiv:2404.13066 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Leveraging Large Language Model as Simulated Patients for Clinical EducationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.
- [1711] arXiv:2404.13067 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Efficient Resume Understanding: A Multi-Granularity Multi-Modal Pre-Training ApproachFeihu Jiang , Chuan Qin , Jingshuai Zhang , Kaichun Yao , Xi Chen , Dazhong Shen , Chen Zhu , Hengshu Zhu , Hui XiongComments: ICME 2024 AcceptedSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the contemporary era of widespread online recruitment, resume understanding has been widely acknowledged as a fundamental and crucial task, which aims to extract structured information from resume documents automatically. Compared to the traditional rule-based approaches, the utilization of recently proposed pre-trained document understanding models can greatly enhance the effectiveness of resume understanding. The present approaches have, however, disregarded the hierarchical relations within the structured information presented in resumes, and have difficulty parsing resumes in an efficient manner. To this end, in this paper, we propose a novel model, namely ERU, to achieve efficient resume understanding. Specifically, we first introduce a layout-aware multi-modal fusion transformer for encoding the segments in the resume with integrated textual, visual, and layout information. Then, we design three self-supervised tasks to pre-train this module via a large number of unlabeled resumes. Next, we fine-tune the model with a multi-granularity sequence labeling task to extract structured information from resumes. Finally, extensive experiments on a real-world dataset clearly demonstrate the effectiveness of ERU.
- [1712] arXiv:2404.13070 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evidence from counterfactual tasks supports emergent analogical reasoning in large language modelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We recently reported evidence that large language models are capable of solving a wide range of text-based analogy problems in a zero-shot manner, indicating the presence of an emergent capacity for analogical reasoning. Two recent commentaries have challenged these results, citing evidence from so-called `counterfactual' tasks in which the standard sequence of the alphabet is arbitrarily permuted so as to decrease similarity with materials that may have been present in the language model's training data. Here, we reply to these critiques, clarifying some misunderstandings about the test materials used in our original work, and presenting evidence that language models are also capable of generalizing to these new counterfactual task variants.
- [1713] arXiv:2404.13071 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Modeling Emotions and Ethics with Large Language ModelsComments: 9 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores the integration of human-like emotions and ethical considerations into Large Language Models (LLMs). We first model eight fundamental human emotions, presented as opposing pairs, and employ collaborative LLMs to reinterpret and express these emotions across a spectrum of intensity. Our focus extends to embedding a latent ethical dimension within LLMs, guided by a novel self-supervised learning algorithm with human feedback (SSHF). This approach enables LLMs to perform self-evaluations and adjustments concerning ethical guidelines, enhancing their capability to generate content that is not only emotionally resonant but also ethically aligned. The methodologies and case studies presented herein illustrate the potential of LLMs to transcend mere text and image generation, venturing into the realms of empathetic interaction and principled decision-making, thereby setting a new precedent in the development of emotionally aware and ethically conscious AI systems.
- [1714] arXiv:2404.13074 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Compositionally Generalizable Semantic Parsing in Large Language Models: A SurveySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Compositional generalization is the ability of a model to generalize to complex, previously unseen types of combinations of entities from just having seen the primitives. This type of generalization is particularly relevant to the semantic parsing community for applications such as task-oriented dialogue, text-to-SQL parsing, and information retrieval, as they can harbor infinite complexity. Despite the success of large language models (LLMs) in a wide range of NLP tasks, unlocking perfect compositional generalization still remains one of the few last unsolved frontiers. The past few years has seen a surge of interest in works that explore the limitations of, methods to improve, and evaluation metrics for compositional generalization capabilities of LLMs for semantic parsing tasks. In this work, we present a literature survey geared at synthesizing recent advances in analysis, methods, and evaluation schemes to offer a starting point for both practitioners and researchers in this area.
- [1715] arXiv:2404.13076 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLM Evaluators Recognize and Favor Their Own GenerationsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.
- [1716] arXiv:2404.13081 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMsJaehyung Kim , Jaehyun Nam , Sangwoo Mo , Jongjin Park , Sang-Woo Lee , Minjoon Seo , Jung-Woo Ha , Jinwoo ShinComments: Accepted at ICLR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have made significant advancements in various natural language processing tasks, including question answering (QA) tasks. While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Augmenting retrieved passages via prompting has the potential to address this limitation, but this direction has been limitedly explored. To this end, we design a simple yet effective framework to enhance open-domain QA (ODQA) with LLMs, based on the summarized retrieval (SuRe). SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval that could be viewed as an explicit rationale extracted from the retrieved passages. Specifically, SuRe first constructs summaries of the retrieved passages for each of the multiple answer candidates. Then, SuRe confirms the most plausible answer from the candidate set by evaluating the validity and ranking of the generated summaries. Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches. SuRe also can be integrated with a broad range of retrieval methods and LLMs. Finally, the generated summaries from SuRe show additional advantages to measure the importance of retrieved passages and serve as more preferred rationales by models and humans.
- [1717] arXiv:2404.13082 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: TREACLE: Thrifty Reasoning via Context-Aware LLM and Prompt SelectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further depends on the exact wording of the question (i.e., the specific prompt). At the same time, users often have a limit on monetary budget and latency to answer all their questions, and they do not know which LLMs to choose for each question to meet their accuracy and long-term budget requirements. To navigate this rich design space, we propose TREACLE (Thrifty Reasoning via Context-Aware LLM and Prompt Selection), a reinforcement learning policy that jointly selects the model and prompting scheme while respecting the user's monetary cost and latency constraints. TREACLE uses the problem context, including question text embeddings (reflecting the type or difficulty of a query) and the response history (reflecting the consistency of previous responses) to make smart decisions. Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC ) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines while maintaining high accuracy. Importantly, it provides the user with the ability to gracefully trade off accuracy for cost.
- [1718] arXiv:2404.13096 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Reducing Redundant Computation in Multi-Agent Coordination through Locally Centralized ExecutionComments: 6 pages, 5 figures, Under reviewSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In multi-agent reinforcement learning, decentralized execution is a common approach, yet it suffers from the redundant computation problem. This occurs when multiple agents redundantly perform the same or similar computation due to overlapping observations. To address this issue, this study introduces a novel method referred to as locally centralized team transformer (LCTT). LCTT establishes a locally centralized execution framework where selected agents serve as leaders, issuing instructions, while the rest agents, designated as workers, act as these instructions without activating their policy networks. For LCTT, we proposed the team-transformer (T-Trans) architecture that allows leaders to provide specific instructions to each worker, and the leadership shift mechanism that allows agents autonomously decide their roles as leaders or workers. Our experimental results demonstrate that the proposed method effectively reduces redundant computation, does not decrease reward levels, and leads to faster learning convergence.
- [1719] arXiv:2404.13099 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mathify: Evaluating Large Language Models on Mathematical Problem Solving TasksAvinash Anand , Mohit Gupta , Kritarth Prasad , Navya Singla , Sanjana Sanjeev , Jatin Kumar , Adarsh Raj Shivam , Rajiv Ratn ShahComments: 10 pages, 3 figures, NeurIPS 2023 Workshop on Generative AI for Education (GAIED)Journal-ref: NeurIPS 2023 Workshop on Generative AI for Education (GAIED)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapid progress in the field of natural language processing (NLP) systems and the expansion of large language models (LLMs) have opened up numerous opportunities in the field of education and instructional methods. These advancements offer the potential for tailored learning experiences and immediate feedback, all delivered through accessible and cost-effective services. One notable application area for this technological advancement is in the realm of solving mathematical problems. Mathematical problem-solving not only requires the ability to decipher complex problem statements but also the skill to perform precise arithmetic calculations at each step of the problem-solving process. However, the evaluation of the arithmetic capabilities of large language models remains an area that has received relatively little attention. In response, we introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. This dataset encompasses mathematical challenges of varying complexity and covers a wide range of mathematical concepts. Utilizing this dataset, we conduct fine-tuning experiments with three prominent LLMs: LLaMA-2, WizardMath, and MAmmoTH. These fine-tuned models serve as benchmarks for evaluating their performance on our dataset. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems. Consequently, MAmmoTH-13B establishes itself as a robust and dependable benchmark for addressing NCERT mathematics problems.
- [1720] arXiv:2404.13101 (cross-list from eess.IV) [ pdf , ps , other ]
-
Title: DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse dataSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
Abstract: Image reconstruction is an essential step of every medical imaging method, including Photoacoustic Tomography (PAT), which is a promising modality of imaging, that unites the benefits of both ultrasound and optical imaging methods. Reconstruction of PAT images using conventional methods results in rough artifacts, especially when applied directly to sparse PAT data. In recent years, generative adversarial networks (GANs) have shown a powerful performance in image generation as well as translation, rendering them a smart choice to be applied to reconstruction tasks. In this study, we proposed an end-to-end method called DensePANet to solve the problem of PAT image reconstruction from sparse data. The proposed model employs a novel modification of UNet in its generator, called FD-UNet++, which considerably improves the reconstruction performance. We evaluated the method on various in-vivo and simulated datasets. Quantitative and qualitative results show the better performance of our model over other prevalent deep learning techniques.
- [1721] arXiv:2404.13104 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multi Class Depression Detection Through Tweets using Artificial IntelligenceComments: 33 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Depression is a significant issue nowadays. As per the World Health Organization (WHO), in 2023, over 280 million individuals are grappling with depression. This is a huge number; if not taken seriously, these numbers will increase rapidly. About 4.89 billion individuals are social media users. People express their feelings and emotions on platforms like Twitter, Facebook, Reddit, Instagram, etc. These platforms contain valuable information which can be used for research purposes. Considerable research has been conducted across various social media platforms. However, certain limitations persist in these endeavors. Particularly, previous studies were only focused on detecting depression and the intensity of depression in tweets. Also, there existed inaccuracies in dataset labeling. In this research work, five types of depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted using tweets from the Twitter database based on lexicon labeling. Explainable AI was used to provide reasoning by highlighting the parts of tweets that represent type of depression. Bidirectional Encoder Representations from Transformers (BERT) was used for feature extraction and training. Machine learning and deep learning methodologies were used to train the model. The BERT model presented the most promising results, achieving an overall accuracy of 0.96.
- [1722] arXiv:2404.13131 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility GapComments: Forthcoming in FAccT 2024Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Two goals - improving replicability and accountability of Machine Learning research respectively, have accrued much attention from the AI ethics and the Machine Learning community. Despite sharing the measures of improving transparency, the two goals are discussed in different registers - replicability registers with scientific reasoning whereas accountability registers with ethical reasoning. Given the existing challenge of the Responsibility Gap - holding Machine Learning scientists accountable for Machine Learning harms due to them being far from sites of application, this paper posits that reconceptualizing replicability can help bridge the gap. Through a shift from model performance replicability to claim replicability, Machine Learning scientists can be held accountable for producing non-replicable claims that are prone to eliciting harm due to misuse and misinterpretation. In this paper, I make the following contributions. First, I define and distinguish two forms of replicability for ML research that can aid constructive conversations around replicability. Second, I formulate an argument for claim-replicability's advantage over model performance replicability in justifying assigning accountability to Machine Learning scientists for producing non-replicable claims and show how it enacts a sense of responsibility that is actionable. In addition, I characterize the implementation of claim replicability as more of a social project than a technical one by discussing its competing epistemological principles, practical implications on Circulating Reference, Interpretative Labor, and research communication.
- [1723] arXiv:2404.13139 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Explainable AI for Fair Sepsis Mortality Predictive ModelComments: Accepted to the 22nd International Conference on Artificial Intelligence in Medicine (AIME'24)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Artificial intelligence supports healthcare professionals with predictive modeling, greatly transforming clinical decision-making. This study addresses the crucial need for fairness and explainability in AI applications within healthcare to ensure equitable outcomes across diverse patient demographics. By focusing on the predictive modeling of sepsis-related mortality, we propose a method that learns a performance-optimized predictive model and then employs the transfer learning process to produce a model with better fairness. Our method also introduces a novel permutation-based feature importance algorithm aiming at elucidating the contribution of each feature in enhancing fairness on predictions. Unlike existing explainability methods concentrating on explaining feature contribution to predictive performance, our proposed method uniquely bridges the gap in understanding how each feature contributes to fairness. This advancement is pivotal, given sepsis's significant mortality rate and its role in one-third of hospital deaths. Our method not only aids in identifying and mitigating biases within the predictive model but also fosters trust among healthcare stakeholders by improving the transparency and fairness of model predictions, thereby contributing to more equitable and trustworthy healthcare delivery.
- [1724] arXiv:2404.13142 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Decentralized Coordination of Distributed Energy Resources through Local Energy Markets and Deep Reinforcement LearningComments: preprint, submitted to Energy and AISubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: As the energy landscape evolves toward sustainability, the accelerating integration of distributed energy resources poses challenges to the operability and reliability of the electricity grid. One significant aspect of this issue is the notable increase in net load variability at the grid edge. Transactive energy, implemented through local energy markets, has recently garnered attention as a promising solution to address the grid challenges in the form of decentralized, indirect demand response on a community level. Given the nature of these challenges, model-free control approaches, such as deep reinforcement learning, show promise for the decentralized automation of participation within this context. Existing studies at the intersection of transactive energy and model-free control primarily focus on socioeconomic and self-consumption metrics, overlooking the crucial goal of reducing community-level net load variability. This study addresses this gap by training a set of deep reinforcement learning agents to automate end-user participation in ALEX, an economy-driven local energy market. In this setting, agents do not share information and only prioritize individual bill optimization. The study unveils a clear correlation between bill reduction and reduced net load variability in this setup. The impact on net load variability is assessed over various time horizons using metrics such as ramping rate, daily and monthly load factor, as well as daily average and total peak export and import on an open-source dataset. Agents are then benchmarked against several baselines, with their performance levels showing promising results, approaching those of a near-optimal dynamic programming benchmark.
- [1725] arXiv:2404.13149 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond Self-Consistency: Ensemble Reasoning Boosts Consistency and Accuracy of LLMs in Cancer StagingComments: accepted to the 22nd International Conference on Artificial Intelligence in Medicine (AIME'24)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Advances in large language models (LLMs) have encouraged their adoption in the healthcare domain where vital clinical information is often contained in unstructured notes. Cancer staging status is available in clinical reports, but it requires natural language processing to extract the status from the unstructured text. With the advance in clinical-oriented LLMs, it is promising to extract such status without extensive efforts in training the algorithms. Prompting approaches of the pre-trained LLMs that elicit a model's reasoning process, such as chain-of-thought, may help to improve the trustworthiness of the generated responses. Using self-consistency further improves model performance, but often results in inconsistent generations across the multiple reasoning paths. In this study, we propose an ensemble reasoning approach with the aim of improving the consistency of the model generations. Using an open access clinical large language model to determine the pathologic cancer stage from real-world pathology reports, we show that the ensemble reasoning approach is able to improve both the consistency and performance of the LLM in determining cancer stage, thereby demonstrating the potential to use these models in clinical or other domains where reliability and trustworthiness are critical.
- [1726] arXiv:2404.13192 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Heterogeneous Subgraph Transformer for Fake News DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Fake news is pervasive on social media, inflicting substantial harm on public discourse and societal well-being. We investigate the explicit structural information and textual features of news pieces by constructing a heterogeneous graph concerning the relations among news topics, entities, and content. Through our study, we reveal that fake news can be effectively detected in terms of the atypical heterogeneous subgraphs centered on them, which encapsulate the essential semantics and intricate relations between news elements. However, suffering from the heterogeneity, exploring such heterogeneous subgraphs remains an open problem. To bridge the gap, this work proposes a heterogeneous subgraph transformer (HeteroSGT) to exploit subgraphs in our constructed heterogeneous graph. In HeteroSGT, we first employ a pre-trained language model to derive both word-level and sentence-level semantics. Then the random walk with restart (RWR) is applied to extract subgraphs centered on each news, which are further fed to our proposed subgraph Transformer to quantify the authenticity. Extensive experiments on five real-world datasets demonstrate the superior performance of HeteroSGT over five baselines. Further case and ablation studies validate our motivation and demonstrate that performance improvement stems from our specially designed components.
- [1727] arXiv:2404.13194 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Privacy-Preserving Debiasing using Data Augmentation and Machine UnlearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Data augmentation is widely used to mitigate data bias in the training dataset. However, data augmentation exposes machine learning models to privacy attacks, such as membership inference attacks. In this paper, we propose an effective combination of data augmentation and machine unlearning, which can reduce data bias while providing a provable defense against known attacks. Specifically, we maintain the fairness of the trained model with diffusion-based data augmentation, and then utilize multi-shard unlearning to remove identifying information of original data from the ML model for protection against privacy attacks. Experimental evaluation across diverse datasets demonstrates that our approach can achieve significant improvements in bias reduction as well as robustness against state-of-the-art privacy attacks.
- [1728] arXiv:2404.13218 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Temperature of Machine Learning SystemsComments: 44 pages, 8 figures, 3 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: We develop a thermodynamic theory for machine learning (ML) systems. Similar to physical thermodynamic systems which are characterized by energy and entropy, ML systems possess these characteristics as well. This comparison inspire us to integrate the concept of temperature into ML systems grounded in the fundamental principles of thermodynamics, and establish a basic thermodynamic framework for machine learning systems with non-Boltzmann distributions. We introduce the concept of states within a ML system, identify two typical types of state, and interpret model training and refresh as a process of state phase transition. We consider that the initial potential energy of a ML system is described by the model's loss functions, and the energy adheres to the principle of minimum potential energy. For a variety of energy forms and parameter initialization methods, we derive the temperature of systems during the phase transition both analytically and asymptotically, highlighting temperature as a vital indicator of system data distribution and ML training complexity. Moreover, we perceive deep neural networks as complex heat engines with both global temperature and local temperatures in each layer. The concept of work efficiency is introduced within neural networks, which mainly depends on the neural activation functions. We then classify neural networks based on their work efficiency, and describe neural networks as two types of heat engines.
- [1729] arXiv:2404.13238 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Personalized Wireless Federated Learning for Large Language ModelsComments: 8 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their deployment in wireless networks still face challenges, i.e., a lack of privacy and security protection mechanisms. Federated Learning (FL) has emerged as a promising approach to address these challenges. Yet, it suffers from issues including inefficient handling with big and heterogeneous data, resource-intensive training, and high communication overhead. To tackle these issues, we first compare different learning stages and their features of LLMs in wireless networks. Next, we introduce two personalized wireless federated fine-tuning methods with low communication overhead, i.e., (1) Personalized Federated Instruction Tuning (PFIT), which employs reinforcement learning to fine-tune local LLMs with diverse reward models to achieve personalization; (2) Personalized Federated Task Tuning (PFTT), which can leverage global adapters and local Low-Rank Adaptations (LoRA) to collaboratively fine-tune local LLMs, where the local LoRAs can be applied to achieve personalization without aggregation. Finally, we perform simulations to demonstrate the effectiveness of the proposed two methods and comprehensively discuss open issues.
- [1730] arXiv:2404.13265 (cross-list from q-bio.GN) [ pdf , ps , other ]
-
Title: F5C-finder: An Explainable and Ensemble Biological Language Model for Predicting 5-Formylcytidine Modifications on mRNAComments: 34 pages, 10 figures, journalSubjects: Genomics (q-bio.GN) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As a prevalent and dynamically regulated epigenetic modification, 5-formylcytidine (f5C) is crucial in various biological processes. However, traditional experimental methods for f5C detection are often laborious and time-consuming, limiting their ability to map f5C sites across the transcriptome comprehensively. While computational approaches offer a cost-effective and high-throughput alternative, no recognition model for f5C has been developed to date. Drawing inspiration from language models in natural language processing, this study presents f5C-finder, an ensemble neural network-based model utilizing multi-head attention for the identification of f5C. Five distinct feature extraction methods were employed to construct five individual artificial neural networks, and these networks were subsequently integrated through ensemble learning to create f5C-finder. 10-fold cross-validation and independent tests demonstrate that f5C-finder achieves state-of-the-art (SOTA) performance with AUC of 0.807 and 0.827, respectively. The result highlights the effectiveness of biological language model in capturing both the order (sequential) and functional meaning (semantics) within genomes. Furthermore, the built-in interpretability allows us to understand what the model is learning, creating a bridge between identifying key sequential elements and a deeper exploration of their biological functions.
- [1731] arXiv:2404.13274 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Augmented Object Intelligence: Making the Analog World Interactable with XR-ObjectsMustafa Doga Dogan , Eric J. Gonzalez , Andrea Colaco , Karan Ahuja , Ruofei Du , Johnny Lee , Mar Gonzalez-Franco , David KimSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing. This paper introduces Augmented Object Intelligence (AOI), a novel XR interaction paradigm designed to blur the lines between digital and physical by equipping real-world objects with the ability to interact as if they were digital, where every object has the potential to serve as a portal to vast digital functionalities. Our approach utilizes object segmentation and classification, combined with the power of Multimodal Large Language Models (MLLMs), to facilitate these interactions. We implement the AOI concept in the form of XR-Objects, an open-source prototype system that provides a platform for users to engage with their physical environment in rich and contextually relevant ways. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. Our contributions are threefold: (1) we define the AOI concept and detail its advantages over traditional AI assistants, (2) detail the XR-Objects system's open-source design and implementation, and (3) show its versatility through a variety of use cases and a user study.
- [1732] arXiv:2404.13278 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Federated Transfer Learning with Task Personalization for Condition Monitoring in Ultrasonic Metal WeldingComments: 37 pages, 8 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
Abstract: Ultrasonic metal welding (UMW) is a key joining technology with widespread industrial applications. Condition monitoring (CM) capabilities are critically needed in UMW applications because process anomalies significantly deteriorate the joining quality. Recently, machine learning models emerged as a promising tool for CM in many manufacturing applications due to their ability to learn complex patterns. Yet, the successful deployment of these models requires substantial training data that may be expensive and time-consuming to collect. Additionally, many existing machine learning models lack generalizability and cannot be directly applied to new process configurations (i.e., domains). Such issues may be potentially alleviated by pooling data across manufacturers, but data sharing raises critical data privacy concerns. To address these challenges, this paper presents a Federated Transfer Learning with Task Personalization (FTL-TP) framework that provides domain generalization capabilities in distributed learning while ensuring data privacy. By effectively learning a unified representation from feature space, FTL-TP can adapt CM models for clients working on similar tasks, thereby enhancing their overall adaptability and performance jointly. To demonstrate the effectiveness of FTL-TP, we investigate two distinct UMW CM tasks, tool condition monitoring and workpiece surface condition classification. Compared with state-of-the-art FL algorithms, FTL-TP achieves a 5.35%--8.08% improvement of accuracy in CM in new target domains. FTL-TP is also shown to perform excellently in challenging scenarios involving unbalanced data distributions and limited client fractions. Furthermore, by implementing the FTL-TP method on an edge-cloud architecture, we show that this method is both viable and efficient in practice. The FTL-TP framework is readily extensible to various other manufacturing applications.
- [1733] arXiv:2404.13292 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization ChallengeKhuyagbaatar Batsuren , Ekaterina Vylomova , Verna Dankers , Tsetsuukhei Delgerbaatar , Omri Uzan , Yuval Pinter , Gábor BellaSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
- [1734] arXiv:2404.13320 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Pixel is a Barrier: Diffusion Models Are More Adversarially Robust Than We ThinkSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Adversarial examples for diffusion models are widely used as solutions for safety concerns. By adding adversarial perturbations to personal images, attackers can not edit or imitate them easily. However, it is essential to note that all these protections target the latent diffusion model (LDMs), the adversarial examples for diffusion models in the pixel space (PDMs) are largely overlooked. This may mislead us to think that the diffusion models are vulnerable to adversarial attacks like most deep models. In this paper, we show novel findings that: even though gradient-based white-box attacks can be used to attack the LDMs, they fail to attack PDMs. This finding is supported by extensive experiments of almost a wide range of attacking methods on various PDMs and LDMs with different model structures, which means diffusion models are indeed much more robust against adversarial attacks. We also find that PDMs can be used as an off-the-shelf purifier to effectively remove the adversarial patterns that were generated on LDMs to protect the images, which means that most protection methods nowadays, to some extent, cannot protect our images from malicious attacks. We hope that our insights will inspire the community to rethink the adversarial samples for diffusion models as protection methods and move forward to more effective protection. Codes are available in this https URL .
- [1735] arXiv:2404.13322 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and ModalitiesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable.
- [1736] arXiv:2404.13327 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Comparative Analysis on Snowmelt-Driven Streamflow Forecasting Using Machine Learning TechniquesComments: 17 pages, 4 Tables, 7 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The rapid advancement of machine learning techniques has led to their widespread application in various domains including water resources. However, snowmelt modeling remains an area that has not been extensively explored. In this study, we propose a state-of-the-art (SOTA) deep learning sequential model, leveraging the Temporal Convolutional Network (TCN), for snowmelt-driven discharge modeling in the Himalayan basin of the Hindu Kush Himalayan Region. To evaluate the performance of our proposed model, we conducted a comparative analysis with other popular models including Support Vector Regression (SVR), Long Short Term Memory (LSTM), and Transformer. Furthermore, Nested cross-validation (CV) is used with five outer folds and three inner folds, and hyper-parameter tuning is performed on the inner folds. To evaluate the performance of the model mean absolute error (MAE), root mean square error (RMSE), R square ($R^{2}$), Kling-Gupta Efficiency (KGE), and Nash-Sutcliffe Efficiency (NSE) are computed for each outer fold. The average metrics revealed that TCN outperformed the other models, with an average MAE of 0.011, RMSE of 0.023, $R^{2}$ of 0.991, KGE of 0.992, and NSE of 0.991. The findings of this study demonstrate the effectiveness of the deep learning model as compared to traditional machine learning approaches for snowmelt-driven streamflow forecasting. Moreover, the superior performance of TCN highlights its potential as a promising deep learning model for similar hydrological applications.
- [1737] arXiv:2404.13340 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Large Language Models as Test Case Generators: Performance Evaluation and EnhancementSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Code generation with Large Language Models (LLMs) has been extensively studied and achieved remarkable progress. As a complementary aspect to code generation, test case generation is of crucial importance in ensuring the quality and reliability of code. However, using LLMs as test case generators has been much less explored. Current research along this line primarily focuses on enhancing code generation with assistance from test cases generated by LLMs, while the performance of LLMs in test case generation alone has not been comprehensively examined. To bridge this gap, we conduct extensive experiments to study how well LLMs can generate high-quality test cases. We find that as the problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases, largely due to their inherent limitations in computation and reasoning. To mitigate this issue, we further propose a multi-agent framework called \emph{TestChain} that decouples the generation of test inputs and test outputs. Notably, TestChain uses a ReAct format conversation chain for LLMs to interact with a Python interpreter in order to provide more accurate test outputs. Our results indicate that TestChain outperforms the baseline by a large margin. Particularly, in terms of the accuracy of test cases, TestChain using GPT-4 as the backbone achieves a 13.84\% improvement over the baseline on the LeetCode-hard dataset.
- [1738] arXiv:2404.13343 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice QuestionsComments: Accepted at BEA 2024 (NAACL Workshop)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs (Falcon, Meditron, Mistral) and employing transformer-based models based on six alternative feature combinations. The results suggest that predicting the difficulty of questions is more challenging. Notably, our top performing methods consistently include the question text, and benefit from the variability of LLM answers, highlighting the potential of LLMs for improving automated assessment in medical licensing exams. We make our code available this https URL .
- [1739] arXiv:2404.13344 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GRANOLA: Adaptive Normalization for Graph Neural NetworksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, significant efforts have been made to refine the design of Graph Neural Network (GNN) layers, aiming to overcome diverse challenges, such as limited expressive power and oversmoothing. Despite their widespread adoption, the incorporation of off-the-shelf normalization layers like BatchNorm or InstanceNorm within a GNN architecture may not effectively capture the unique characteristics of graph-structured data, potentially reducing the expressive power of the overall architecture. Moreover, existing graph-specific normalization layers often struggle to offer substantial and consistent benefits. In this paper, we propose GRANOLA, a novel graph-adaptive normalization layer. Unlike existing normalization layers, GRANOLA normalizes node features by adapting to the specific characteristics of the graph, particularly by generating expressive representations of its neighborhood structure, obtained by leveraging the propagation of Random Node Features (RNF) in the graph. We present theoretical results that support our design choices. Our extensive empirical evaluation of various graph benchmarks underscores the superior performance of GRANOLA over existing normalization techniques. Furthermore, GRANOLA emerges as the top-performing method among all baselines within the same time complexity of Message Passing Neural Networks (MPNNs).
- [1740] arXiv:2404.13347 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Augmenting Safety-Critical Driving Scenarios while Preserving Similarity to Expert TrajectoriesComments: Accepted to 35th IEEE Intelligent Vehicles Symposium, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Trajectory augmentation serves as a means to mitigate distributional shift in imitation learning. However, imitating trajectories that inadequately represent the original expert data can result in undesirable behaviors, particularly in safety-critical scenarios. We propose a trajectory augmentation method designed to maintain similarity with expert trajectory data. To accomplish this, we first cluster trajectories to identify minority yet safety-critical groups. Then, we combine the trajectories within the same cluster through geometrical transformation to create new trajectories. These trajectories are then added to the training dataset, provided that they meet our specified safety-related criteria. Our experiments exhibit that training an imitation learning model using these augmented trajectories can significantly improve closed-loop performance.
- [1741] arXiv:2404.13358 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Music Consistency ModelsSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.
- [1742] arXiv:2404.13362 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Semantically Corrected Amharic Automatic Speech RecognitionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Abstract: Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5\% and a Word Error Rate (WER) of 23.3\%.
- [1743] arXiv:2404.13397 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Retrieval-Augmented Generation-based Relation ExtractionComments: Submitted to Semantic Web Journal. Under ReviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks.
This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing. - [1744] arXiv:2404.13402 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Intrusion Detection at Scale with the Assistance of a Command-line Language ModelComments: Accepted by IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), industry trackSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Intrusion detection is a long standing and crucial problem in security. A system capable of detecting intrusions automatically is on great demand in enterprise security solutions. Existing solutions rely heavily on hand-crafted rules designed by security operators, which suffer from high false negative rates and poor generalization ability to new, zero-day attacks at scale. AI and machine learning offer promising solutions to address the issues, by inspecting abnormal user behaviors intelligently and automatically from data. However, existing learning-based intrusion detection systems in the literature are mostly designed for small data, and they lack the ability to leverage the power of big data in cloud environments. In this paper, we target at this problem and introduce an intrusion detection system which incorporates large-scale pre-training, so as to train a large language model based on tens of millions of command lines for AI-based intrusion detection. Experiments performed on 30 million training samples and 10 million test samples verify the effectiveness of our solution.
- [1745] arXiv:2404.13417 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Efficient and Concise Explanations for Object Detection with Gaussian-Class Activation Mapping ExplainerQuoc Khanh Nguyen , Truong Thanh Hung Nguyen , Vo Thanh Khang Nguyen , Van Binh Truong , Tuong Phan , Hung CaoComments: Canadian AI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: To address the challenges of providing quick and plausible explanations in Explainable AI (XAI) for object detection models, we introduce the Gaussian Class Activation Mapping Explainer (G-CAME). Our method efficiently generates concise saliency maps by utilizing activation maps from selected layers and applying a Gaussian kernel to emphasize critical image regions for the predicted object. Compared with other Region-based approaches, G-CAME significantly reduces explanation time to 0.5 seconds without compromising the quality. Our evaluation of G-CAME, using Faster-RCNN and YOLOX on the MS-COCO 2017 dataset, demonstrates its ability to offer highly plausible and faithful explanations, especially in reducing the bias on tiny object detection.
- [1746] arXiv:2404.13421 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: MultiConfederated Learning: Inclusive Non-IID Data handling with Decentralized Federated LearningJournal-ref: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, SAC '24, 1587-1595, April 2024. ACMSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Federated Learning (FL) has emerged as a prominent privacy-preserving technique for enabling use cases like confidential clinical machine learning. FL operates by aggregating models trained by remote devices which owns the data. Thus, FL enables the training of powerful global models using crowd-sourced data from a large number of learners, without compromising their privacy. However, the aggregating server is a single point of failure when generating the global model. Moreover, the performance of the model suffers when the data is not independent and identically distributed (non-IID data) on all remote devices. This leads to vastly different models being aggregated, which can reduce the performance by as much as 50% in certain scenarios.
In this paper, we seek to address the aforementioned issues while retaining the benefits of FL. We propose MultiConfederated Learning: a decentralized FL framework which is designed to handle non-IID data. Unlike traditional FL, MultiConfederated Learning will maintain multiple models in parallel (instead of a single global model) to help with convergence when the data is non-IID. With the help of transfer learning, learners can converge to fewer models. In order to increase adaptability, learners are allowed to choose which updates to aggregate from their peers. - [1747] arXiv:2404.13425 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AdvLoRA: Adversarial Low-Rank Adaptation of Vision-Language ModelsYuheng Ji , Yue Liu , Zhicheng Zhang , Zhao Zhang , Yuting Zhao , Gang Zhou , Xingwei Zhang , Xinwang Liu , Xiaolong ZhengSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Vision-Language Models (VLMs) are a significant technique for Artificial General Intelligence (AGI). With the fast growth of AGI, the security problem become one of the most important challenges for VLMs. In this paper, through extensive experiments, we demonstrate the vulnerability of the conventional adaptation methods for VLMs, which may bring significant security risks. In addition, as the size of the VLMs increases, performing conventional adversarial adaptation techniques on VLMs results in high computational costs. To solve these problems, we propose a parameter-efficient \underline{Adv}ersarial adaptation method named \underline{AdvLoRA} by \underline{Lo}w-\underline{R}ank \underline{A}daptation. At first, we investigate and reveal the intrinsic low-rank property during the adversarial adaptation for VLMs. Different from LoRA, we improve the efficiency and robustness of adversarial adaptation by designing a novel reparameterizing method based on parameter clustering and parameter alignment. In addition, an adaptive parameter update strategy is proposed to further improve the robustness. By these settings, our proposed AdvLoRA alleviates the model security and high resource waste problems. Extensive experiments demonstrate the effectiveness and efficiency of the AdvLoRA.
- [1748] arXiv:2404.13428 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Text-dependent Speaker Verification (TdSV) Challenge 2024: Challenge Evaluation PlanSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: This document outlines the Text-dependent Speaker Verification (TdSV) Challenge 2024, which centers on analyzing and exploring novel approaches for text-dependent speaker verification. The primary goal of this challenge is to motive participants to develop single yet competitive systems, conduct thorough analyses, and explore innovative concepts such as multi-task learning, self-supervised learning, few-shot learning, and others, for text-dependent speaker verification.
- [1749] arXiv:2404.13434 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature ProcessingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and improve utilization, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.
- [1750] arXiv:2404.13449 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SiNC+: Adaptive Camera-Based Vitals with Unsupervised Learning of Periodic SignalsComments: Extension of CVPR2023 highlight paper. arXiv admin note: substantial text overlap with arXiv:2303.07944Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Subtle periodic signals, such as blood volume pulse and respiration, can be extracted from RGB video, enabling noncontact health monitoring at low cost. Advancements in remote pulse estimation -- or remote photoplethysmography (rPPG) -- are currently driven by deep learning solutions. However, modern approaches are trained and evaluated on benchmark datasets with ground truth from contact-PPG sensors. We present the first non-contrastive unsupervised learning framework for signal regression to mitigate the need for labelled video data. With minimal assumptions of periodicity and finite bandwidth, our approach discovers the blood volume pulse directly from unlabelled videos. We find that encouraging sparse power spectra within normal physiological bandlimits and variance over batches of power spectra is sufficient for learning visual features of periodic signals. We perform the first experiments utilizing unlabelled video data not specifically created for rPPG to train robust pulse rate estimators. Given the limited inductive biases, we successfully applied the same approach to camera-based respiration by changing the bandlimits of the target signal. This shows that the approach is general enough for unsupervised learning of bandlimited quasi-periodic signals from different domains. Furthermore, we show that the framework is effective for finetuning models on unlabelled video from a single subject, allowing for personalized and adaptive signal regressors.
- [1751] arXiv:2404.13470 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: GWLZ: A Group-wise Learning-based Lossy Compression Framework for Scientific DataSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: The rapid expansion of computational capabilities and the ever-growing scale of modern HPC systems present formidable challenges in managing exascale scientific data. Faced with such vast datasets, traditional lossless compression techniques prove insufficient in reducing data size to a manageable level while preserving all information intact. In response, researchers have turned to error-bounded lossy compression methods, which offer a balance between data size reduction and information retention. However, despite their utility, these compressors employing conventional techniques struggle with limited reconstruction quality. To address this issue, we draw inspiration from recent advancements in deep learning and propose GWLZ, a novel group-wise learning-based lossy compression framework with multiple lightweight learnable enhancer models. Leveraging a group of neural networks, GWLZ significantly enhances the decompressed data reconstruction quality with negligible impact on the compression efficiency. Experimental results on different fields from the Nyx dataset demonstrate remarkable improvements by GWLZ, achieving up to 20% quality enhancements with negligible overhead as low as 0.0003x.
- [1752] arXiv:2404.13474 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation ModelsComments: ICRA 2024. Project website: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose $\textbf{POCR}$, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.
- [1753] arXiv:2404.13475 (cross-list from quant-ph) [ pdf , ps , html , other ]
-
Title: PristiQ: A Co-Design Framework for Preserving Data Security of Quantum Learning in the CloudSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Abstract: Benefiting from cloud computing, today's early-stage quantum computers can be remotely accessed via the cloud services, known as Quantum-as-a-Service (QaaS). However, it poses a high risk of data leakage in quantum machine learning (QML). To run a QML model with QaaS, users need to locally compile their quantum circuits including the subcircuit of data encoding first and then send the compiled circuit to the QaaS provider for execution. If the QaaS provider is untrustworthy, the subcircuit to encode the raw data can be easily stolen. Therefore, we propose a co-design framework for preserving the data security of QML with the QaaS paradigm, namely PristiQ. By introducing an encryption subcircuit with extra secure qubits associated with a user-defined security key, the security of data can be greatly enhanced. And an automatic search algorithm is proposed to optimize the model to maintain its performance on the encrypted quantum data. Experimental results on simulation and the actual IBM quantum computer both prove the ability of PristiQ to provide high security for the quantum data while maintaining the model performance in QML.
- [1754] arXiv:2404.13476 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Framework for Feasible Counterfactual Exploration incorporating Causality, Sparsity and DensityComments: 10 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The imminent need to interpret the output of a Machine Learning model with counterfactual (CF) explanations - via small perturbations to the input - has been notable in the research community. Although the variety of CF examples is important, the aspect of them being feasible at the same time, does not necessarily apply in their entirety. This work uses different benchmark datasets to examine through the preservation of the logical causal relations of their attributes, whether CF examples can be generated after a small amount of changes to the original input, be feasible and actually useful to the end-user in a real-world case. To achieve this, we used a black box model as a classifier, to distinguish the desired from the input class and a Variational Autoencoder (VAE) to generate feasible CF examples. As an extension, we also extracted two-dimensional manifolds (one for each dataset) that located the majority of the feasible examples, a representation that adequately distinguished them from infeasible ones. For our experimentation we used three commonly used datasets and we managed to generate feasible and at the same time sparse, CF examples that satisfy all possible predefined causal constraints, by confirming their importance with the attributes in a dataset.
- [1755] arXiv:2404.13496 (cross-list from math.NA) [ pdf , ps , html , other ]
-
Title: ODE-DPS: ODE-based Diffusion Posterior Sampling for Inverse Problems in Partial Differential EquationSubjects: Numerical Analysis (math.NA) ; Artificial Intelligence (cs.AI)
Abstract: In recent years we have witnessed a growth in mathematics for deep learning, which has been used to solve inverse problems of partial differential equations (PDEs). However, most deep learning-based inversion methods either require paired data or necessitate retraining neural networks for modifications in the conditions of the inverse problem, significantly reducing the efficiency of inversion and limiting its applicability. To overcome this challenge, in this paper, leveraging the score-based generative diffusion model, we introduce a novel unsupervised inversion methodology tailored for solving inverse problems arising from PDEs. Our approach operates within the Bayesian inversion framework, treating the task of solving the posterior distribution as a conditional generation process achieved through solving a reverse-time stochastic differential equation. Furthermore, to enhance the accuracy of inversion results, we propose an ODE-based Diffusion Posterior Sampling inversion algorithm. The algorithm stems from the marginal probability density functions of two distinct forward generation processes that satisfy the same Fokker-Planck equation. Through a series of experiments involving various PDEs, we showcase the efficiency and robustness of our proposed method.
- [1756] arXiv:2404.13500 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generalized Regression with Conditional GANsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Regression is typically treated as a curve-fitting process where the goal is to fit a prediction function to data. With the help of conditional generative adversarial networks, we propose to solve this age-old problem in a different way; we aim to learn a prediction function whose outputs, when paired with the corresponding inputs, are indistinguishable from feature-label pairs in the training dataset. We show that this approach to regression makes fewer assumptions on the distribution of the data we are fitting to and, therefore, has better representation capabilities. We draw parallels with generalized linear models in statistics and show how our proposal serves as an extension of them to neural networks. We demonstrate the superiority of this new approach to standard regression with experiments on multiple synthetic and publicly available real-world datasets, finding encouraging results, especially with real-world heavy-tailed regression datasets. To make our work more reproducible, we release our source code. Link to repository: https://anonymous.4open.science/r/regressGAN-7B71/
- [1757] arXiv:2404.13506 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Parameter Efficient Fine Tuning: A Comprehensive Analysis Across ApplicationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The rise of deep learning has marked significant progress in fields such as computer vision, natural language processing, and medical imaging, primarily through the adaptation of pre-trained models for specific tasks. Traditional fine-tuning methods, involving adjustments to all parameters, face challenges due to high computational and memory demands. This has led to the development of Parameter Efficient Fine-Tuning (PEFT) techniques, which selectively update parameters to balance computational efficiency with performance. This review examines PEFT approaches, offering a detailed comparison of various strategies highlighting applications across different domains, including text generation, medical imaging, protein modeling, and speech synthesis. By assessing the effectiveness of PEFT methods in reducing computational load, speeding up training, and lowering memory usage, this paper contributes to making deep learning more accessible and adaptable, facilitating its wider application and encouraging innovation in model optimization. Ultimately, the paper aims to contribute towards insights into PEFT's evolving landscape, guiding researchers and practitioners in overcoming the limitations of conventional fine-tuning approaches.
- [1758] arXiv:2404.13509 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative AttentionComments: Main paper (5 pages). Accepted for publication by ICME 2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
- [1759] arXiv:2404.13515 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedTrans: Efficient Federated Learning via Multi-Model TransformationJournal-ref: MLSys (2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated learning (FL) aims to train machine learning (ML) models across potentially millions of edge client devices. Yet, training and customizing models for FL clients is notoriously challenging due to the heterogeneity of client data, device capabilities, and the massive scale of clients, making individualized model exploration prohibitively expensive. State-of-the-art FL solutions personalize a globally trained model or concurrently train multiple models, but they often incur suboptimal model accuracy and huge training costs.
In this paper, we introduce FedTrans, a multi-model FL training framework that automatically produces and trains high-accuracy, hardware-compatible models for individual clients at scale. FedTrans begins with a basic global model, identifies accuracy bottlenecks in model architectures during training, and then employs model transformation to derive new models for heterogeneous clients on the fly. It judiciously assigns models to individual clients while performing soft aggregation on multi-model updates to minimize total training costs. Our evaluations using realistic settings show that FedTrans improves individual client model accuracy by 14% - 72% while slashing training costs by 1.6X - 20X over state-of-the-art solutions. - [1760] arXiv:2404.13518 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Reliable Model Watermarking: Defending Against Theft without Compromising on EvasionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: With the rise of Machine Learning as a Service (MLaaS) platforms,safeguarding the intellectual property of deep learning models is becoming paramount. Among various protective measures, trigger set watermarking has emerged as a flexible and effective strategy for preventing unauthorized model distribution. However, this paper identifies an inherent flaw in the current paradigm of trigger set watermarking: evasion adversaries can readily exploit the shortcuts created by models memorizing watermark samples that deviate from the main task distribution, significantly impairing their generalization in adversarial settings. To counteract this, we leverage diffusion models to synthesize unrestricted adversarial examples as trigger sets. By learning the model to accurately recognize them, unique watermark behaviors are promoted through knowledge injection rather than error memorization, thus avoiding exploitable shortcuts. Furthermore, we uncover that the resistance of current trigger set watermarking against removal attacks primarily relies on significantly damaging the decision boundaries during embedding, intertwining unremovability with adverse impacts. By optimizing the knowledge transfer properties of protected models, our approach conveys watermark behaviors to extraction surrogates without aggressively decision boundary perturbation. Experimental results on CIFAR-10/100 and Imagenette datasets demonstrate the effectiveness of our method, showing not only improved robustness against evasion adversaries but also superior resistance to watermark removal attacks compared to state-of-the-art solutions.
- [1761] arXiv:2404.13521 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Graph4GUI: Graph Neural Networks for Representing Graphical User InterfacesComments: 18 pagesSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Present-day graphical user interfaces (GUIs) exhibit diverse arrangements of text, graphics, and interactive elements such as buttons and menus, but representations of GUIs have not kept up. They do not encapsulate both semantic and visuo-spatial relationships among elements. To seize machine learning's potential for GUIs more efficiently, Graph4GUI exploits graph neural networks to capture individual elements' properties and their semantic-visuo-spatial constraints in a layout. The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task, which involved predicting the positions of remaining unplaced elements in a partially completed GUI. The new model's suggestions showed alignment and visual appeal superior to the baseline method and received higher subjective ratings for preference. Furthermore, we demonstrate the practical benefits and efficiency advantages designers perceive when utilizing our model as an autocompletion plug-in.
- [1762] arXiv:2404.13528 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on MobileWei Niu , Md Musfiqur Rahman Sanim , Zhihao Shu , Jiexiong Guan , Xipeng Shen , Miao Yin , Gagan Agrawal , Bin RenSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: This work is motivated by recent developments in Deep Neural Networks, particularly the Transformer architectures underlying applications such as ChatGPT, and the need for performing inference on mobile devices. Focusing on emerging transformers (specifically the ones with computationally efficient Swin-like architectures) and large models (e.g., Stable Diffusion and LLMs) based on transformers, we observe that layout transformations between the computational operators cause a significant slowdown in these applications. This paper presents SmartMem, a comprehensive framework for eliminating most layout transformations, with the idea that multiple operators can use the same tensor layout through careful choice of layout and implementation of operations. Our approach is based on classifying the operators into four groups, and considering combinations of producer-consumer edges between the operators. We develop a set of methods for searching such layouts. Another component of our work is developing efficient memory layouts for 2.5 dimensional memory commonly seen in mobile devices. Our experimental results show that SmartMem outperforms 5 state-of-the-art DNN execution frameworks on mobile devices across 18 varied neural networks, including CNNs, Transformers with both local and global attention, as well as LLMs. In particular, compared to DNNFusion, SmartMem achieves an average speedup of 2.8$\times$, and outperforms TVM and MNN with speedups of 6.9$\times$ and 7.9$\times$, respectively, on average.
- [1763] arXiv:2404.13555 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Cell Phone Image-Based Persian Rice Detection and Classification Using Deep Learning TechniquesComments: 7 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This study introduces an innovative approach to classifying various types of Persian rice using image-based deep learning techniques, highlighting the practical application of everyday technology in food categorization. Recognizing the diversity of Persian rice and its culinary significance, we leveraged the capabilities of convolutional neural networks (CNNs), specifically by fine-tuning a ResNet model for accurate identification of different rice varieties and employing a U-Net architecture for precise segmentation of rice grains in bulk images. This dual-methodology framework allows for both individual grain classification and comprehensive analysis of bulk rice samples, addressing two crucial aspects of rice quality assessment. Utilizing images captured with consumer-grade cell phones reflects a realistic scenario in which individuals can leverage this technology for assistance with grocery shopping and meal preparation. The dataset, comprising various rice types photographed under natural conditions without professional lighting or equipment, presents a challenging yet practical classification problem. Our findings demonstrate the feasibility of using non-professional images for food classification and the potential of deep learning models, like ResNet and U-Net, to adapt to the nuances of everyday objects and textures. This study contributes to the field by providing insights into the applicability of image-based deep learning in daily life, specifically for enhancing consumer experiences and knowledge in food selection. Furthermore, it opens avenues for extending this approach to other food categories and practical applications, emphasizing the role of accessible technology in bridging the gap between sophisticated computational methods and everyday tasks.
- [1764] arXiv:2404.13564 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Masked Latent Transformer with the Random Masking Ratio to Advance the Diagnosis of Dental FluorosisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Dental fluorosis is a chronic disease caused by long-term overconsumption of fluoride, which leads to changes in the appearance of tooth enamel. It is an important basis for early non-invasive diagnosis of endemic fluorosis. However, even dental professionals may not be able to accurately distinguish dental fluorosis and its severity based on tooth images. Currently, there is still a gap in research on applying deep learning to diagnosing dental fluorosis. Therefore, we construct the first open-source dental fluorosis image dataset (DFID), laying the foundation for deep learning research in this field. To advance the diagnosis of dental fluorosis, we propose a pioneering deep learning model called masked latent transformer with the random masking ratio (MLTrMR). MLTrMR introduces a mask latent modeling scheme based on Vision Transformer to enhance contextual learning of dental fluorosis lesion characteristics. Consisting of a latent embedder, encoder, and decoder, MLTrMR employs the latent embedder to extract latent tokens from the original image, whereas the encoder and decoder comprising the latent transformer (LT) block are used to process unmasked tokens and predict masked tokens, respectively. To mitigate the lack of inductive bias in Vision Transformer, which may result in performance degradation, the LT block introduces latent tokens to enhance the learning capacity of latent lesion features. Furthermore, we design an auxiliary loss function to constrain the parameter update direction of the model. MLTrMR achieves 80.19% accuracy, 75.79% F1, and 81.28% quadratic weighted kappa on DFID, making it state-of-the-art (SOTA).
- [1765] arXiv:2404.13565 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring Diverse Methods in Visual Question AnsweringSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.
- [1766] arXiv:2404.13571 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Test-Time Training on Graphs with Large Language Models (LLMs)Jiaxin Zhang , Yiqi Wang , Xihong Yang , Siwei Wang , Yu Feng , Yu Shi , Ruicaho Ren , En Zhu , Xinwang LiuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph Neural Networks have demonstrated great success in various fields of multimedia. However, the distribution shift between the training and test data challenges the effectiveness of GNNs. To mitigate this challenge, Test-Time Training (TTT) has been proposed as a promising approach. Traditional TTT methods require a demanding unsupervised training strategy to capture the information from test to benefit the main task. Inspired by the great annotation ability of Large Language Models (LLMs) on Text-Attributed Graphs (TAGs), we propose to enhance the test-time training on graphs with LLMs as annotators. In this paper, we design a novel Test-Time Training pipeline, LLMTTT, which conducts the test-time adaptation under the annotations by LLMs on a carefully-selected node set. Specifically, LLMTTT introduces a hybrid active node selection strategy that considers not only node diversity and representativeness, but also prediction signals from the pre-trained model. Given annotations from LLMs, a two-stage training strategy is designed to tailor the test-time model with the limited and noisy labels. A theoretical analysis ensures the validity of our method and extensive experiments demonstrate that the proposed LLMTTT can achieve a significant performance improvement compared to existing Out-of-Distribution (OOD) generalization methods.
- [1767] arXiv:2404.13575 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: FedMPQ: Secure and Communication-Efficient Federated Learning with Multi-codebook Product QuantizationSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: In federated learning, particularly in cross-device scenarios, secure aggregation has recently gained popularity as it effectively defends against inference attacks by malicious aggregators. However, secure aggregation often requires additional communication overhead and can impede the convergence rate of the global model, which is particularly challenging in wireless network environments with extremely limited bandwidth. Therefore, achieving efficient communication compression under the premise of secure aggregation presents a highly challenging and valuable problem. In this work, we propose a novel uplink communication compression method for federated learning, named FedMPQ, which is based on multi shared codebook product quantization.Specifically, we utilize updates from the previous round to generate sufficiently robust codebooks. Secure aggregation is then achieved through trusted execution environments (TEE) or a trusted third party (TTP).In contrast to previous works, our approach exhibits greater robustness in scenarios where data is not independently and identically distributed (non-IID) and there is a lack of sufficient public data. The experiments conducted on the LEAF dataset demonstrate that our proposed method achieves 99% of the baseline's final accuracy, while reducing uplink communications by 90-95%
- [1768] arXiv:2404.13579 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention FusionsXiaoran Zhao , Tianhao Wu , Yu Lai , Zhiliang Tian , Zhen Huang , Yahui Liu , Zejiang He , Dongsheng LiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Controllable text-to-image generation synthesizes visual text and objects in images with certain conditions, which are frequently applied to emoji and poster generation. Visual text rendering and layout-to-image generation tasks have been popular in controllable text-to-image generation. However, each of these tasks typically focuses on single modality generation or rendering, leaving yet-to-be-bridged gaps between the approaches correspondingly designed for each of the tasks. In this paper, we combine text rendering and layout-to-image generation tasks into a single task: layout-controllable text-object synthesis (LTOS) task, aiming at synthesizing images with object and visual text based on predefined object layout and text contents. As compliant datasets are not readily available for our LTOS task, we construct a layout-aware text-object synthesis dataset, containing elaborate well-aligned labels of visual text and object information. Based on the dataset, we propose a layout-controllable text-object adaptive fusion (TOF) framework, which generates images with clear, legible visual text and plausible objects. We construct a visual-text rendering module to synthesize text and employ an object-layout control module to generate objects while integrating the two modules to harmoniously generate and integrate text content and objects in images. To better the image-text integration, we propose a self-adaptive cross-attention fusion module that helps the image generation to attend more to important text information. Within such a fusion module, we use a self-adaptive learnable factor to learn to flexibly control the influence of cross-attention outputs on image generation. Experimental results show that our method outperforms the state-of-the-art in LTOS, text rendering, and layout-to-image tasks, enabling harmonious visual text rendering and object generation.
- [1769] arXiv:2404.13588 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Machine Unlearning via Null Space CalibrationComments: Accepted by IJCAI-2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Machine unlearning aims to enable models to forget specific data instances when receiving deletion requests. Current research centres on efficient unlearning to erase the influence of data from the model and neglects the subsequent impacts on the remaining data. Consequently, existing unlearning algorithms degrade the model's performance after unlearning, known as \textit{over-unlearning}. This paper addresses this critical yet under-explored issue by introducing machine \underline{U}nlearning via \underline{N}ull \underline{S}pace \underline{C}alibration (UNSC), which can accurately unlearn target samples without over-unlearning. On the contrary, by calibrating the decision space during unlearning, UNSC can significantly improve the model's performance on the remaining samples. In particular, our approach hinges on confining the unlearning process to a specified null space tailored to the remaining samples, which is augmented by strategically pseudo-labeling the unlearning samples. Comparative analyses against several established baselines affirm the superiority of our approach. Code is released at this \href{ this https URL }{URL}.
- [1770] arXiv:2404.13594 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language ResamplersComments: NAACL 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
- [1771] arXiv:2404.13604 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: CKGConv: General Graph Convolution with Continuous KernelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The existing definitions of graph convolution, either from spatial or spectral perspectives, are inflexible and not unified. Defining a general convolution operator in the graph domain is challenging due to the lack of canonical coordinates, the presence of irregular structures, and the properties of graph symmetries. In this work, we propose a novel graph convolution framework by parameterizing the kernels as continuous functions of pseudo-coordinates derived via graph positional encoding. We name this Continuous Kernel Graph Convolution (CKGConv). Theoretically, we demonstrate that CKGConv is flexible and expressive. CKGConv encompasses many existing graph convolutions, and exhibits the same expressiveness as graph transformers in terms of distinguishing non-isomorphic graphs. Empirically, we show that CKGConv-based Networks outperform existing graph convolutional networks and perform comparably to the best graph transformers across a variety of graph datasets.
- [1772] arXiv:2404.13627 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation SurroundingChunkit Chan , Cheng Jiayang , Yauwai Yim , Zheye Deng , Wei Fan , Haoran Li , Xin Liu , Hongming Zhang , Weiqi Wang , Yangqiu SongSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.
- [1773] arXiv:2404.13630 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: Utilizing Deep Learning to Optimize Software Development ProcessesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: This study explores the application of deep learning technologies in software development processes, particularly in automating code reviews, error prediction, and test generation to enhance code quality and development efficiency. Through a series of empirical studies, experimental groups using deep learning tools and control groups using traditional methods were compared in terms of code error rates and project completion times. The results demonstrated significant improvements in the experimental group, validating the effectiveness of deep learning technologies. The research also discusses potential optimization points, methodologies, and technical challenges of deep learning in software development, as well as how to integrate these technologies into existing software development workflows.
- [1774] arXiv:2404.13634 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial NetworksJournal-ref: Journal of Artificial Intelligence Research, vol. 79, Apr. 2024, 1313-41Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Synthetic data generation offers a promising solution to enhance the usefulness of Electronic Healthcare Records (EHR) by generating realistic de-identified data. However, the existing literature primarily focuses on the quality of synthetic health data, neglecting the crucial aspect of fairness in downstream predictions. Consequently, models trained on synthetic EHR have faced criticism for producing biased outcomes in target tasks. These biases can arise from either spurious correlations between features or the failure of models to accurately represent sub-groups. To address these concerns, we present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain. In order to tackle spurious correlations (i), we propose an information-constrained Data Generation Process that enables the generator to learn a fair deterministic transformation based on a well-defined notion of algorithmic fairness. To overcome the challenge of capturing exact sub-group representations (ii), we incentivize the generator to preserve sub-group densities through score-based weighted sampling. This approach compels the generator to learn from underrepresented regions of the data manifold. We conduct extensive experiments using the MIMIC-III database. Our results demonstrate that Bt-GAN achieves SOTA accuracy while significantly improving fairness and minimizing bias amplification. We also perform an in-depth explainability analysis to provide additional evidence supporting the validity of our study. In conclusion, our research introduces a novel and professional approach to addressing the limitations of synthetic data generation in the healthcare domain. By incorporating fairness considerations and leveraging advanced techniques such as GANs, we pave the way for more reliable and unbiased predictions in healthcare applications.
- [1775] arXiv:2404.13652 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: BANSAI: Towards Bridging the AI Adoption Gap in Industrial Robotics with Neurosymbolic ProgrammingComments: 6 pages, 3 figures, accepted at the 2024 CIRP International Conference on Manufacturing Systems (CMS)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Abstract: Over the past decade, deep learning helped solve manipulation problems across all domains of robotics. At the same time, industrial robots continue to be programmed overwhelmingly using traditional program representations and interfaces. This paper undertakes an analysis of this "AI adoption gap" from an industry practitioner's perspective. In response, we propose the BANSAI approach (Bridging the AI Adoption Gap via Neurosymbolic AI). It systematically leverages principles of neurosymbolic AI to establish data-driven, subsymbolic program synthesis and optimization in modern industrial robot programming workflow. BANSAI conceptually unites several lines of prior research and proposes a path toward practical, real-world validation.
- [1776] arXiv:2404.13655 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SPGNN: Recognizing Salient Subgraph Patterns via Enhanced Graph Convolution and PoolingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph neural networks (GNNs) have revolutionized the field of machine learning on non-Euclidean data such as graphs and networks. GNNs effectively implement node representation learning through neighborhood aggregation and achieve impressive results in many graph-related tasks. However, most neighborhood aggregation approaches are summation-based, which can be problematic as they may not be sufficiently expressive to encode informative graph structures. Furthermore, though the graph pooling module is also of vital importance for graph learning, especially for the task of graph classification, research on graph down-sampling mechanisms is rather limited.
To address the above challenges, we propose a concatenation-based graph convolution mechanism that injectively updates node representations to maximize the discriminative power in distinguishing non-isomorphic subgraphs. In addition, we design a novel graph pooling module, called WL-SortPool, to learn important subgraph patterns in a deep-learning manner. WL-SortPool layer-wise sorts node representations (i.e. continuous WL colors) to separately learn the relative importance of subtrees with different depths for the purpose of classification, thus better characterizing the complex graph topology and rich information encoded in the graph. We propose a novel Subgraph Pattern GNN (SPGNN) architecture that incorporates these enhancements. We test the proposed SPGNN architecture on many graph classification benchmarks. Experimental results show that our method can achieve highly competitive results with state-of-the-art graph kernels and other GNN approaches. - [1777] arXiv:2404.13657 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human MotionsComments: 13 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at this https URL .
- [1778] arXiv:2404.13663 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Cumulative Hazard Function Based Efficient Multivariate Temporal Point Process LearningComments: 8 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Most existing temporal point process models are characterized by conditional intensity function. These models often require numerical approximation methods for likelihood evaluation, which potentially hurts their performance. By directly modelling the integral of the intensity function, i.e., the cumulative hazard function (CHF), the likelihood can be evaluated accurately, making it a promising approach. However, existing CHF-based methods are not well-defined, i.e., the mathematical constraints of CHF are not completely satisfied, leading to untrustworthy results. For multivariate temporal point process, most existing methods model intensity (or density, etc.) functions for each variate, limiting the scalability. In this paper, we explore using neural networks to model a flexible but well-defined CHF and learning the multivariate temporal point process with low parameter complexity. Experimental results on six datasets show that the proposed model achieves the state-of-the-art performance on data fitting and event prediction tasks while having significantly fewer parameters and memory usage than the strong competitors. The source code and data can be obtained from this https URL .
- [1779] arXiv:2404.13667 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MathNet: A Data-Centric Approach for Printed Mathematical Expression RecognitionFelix M. Schmitt-Koopmann , Elaine M. Huang , Hans-Peter Hutter , Thilo Stadelmann , Alireza DarvishyComments: 12 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%.
- [1780] arXiv:2404.13680 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: PoseAnimate: Zero-shot high fidelity pose controllable character animationBingwen Zhu , Fanyi Wang , Tianyi Lu , Peng Liu , Jingwen Su , Jinxiu Liu , Yanhao Zhang , Zuxuan Wu , Yu-Gang Jiang , Guo-Jun QiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image-to-video(I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity with the source image.However, existing approaches suffer from character appearance inconsistency and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally this http URL address these limitations,we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.
- [1781] arXiv:2404.13706 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Concept Arithmetics for Circumventing Concept Inhibition in Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models.
Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised.
Project page: this https URL - [1782] arXiv:2404.13719 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: A Practical Multilevel Governance Framework for Autonomous and Intelligent SystemsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Autonomous and intelligent systems (AIS) facilitate a wide range of beneficial applications across a variety of different domains. However, technical characteristics such as unpredictability and lack of transparency, as well as potential unintended consequences, pose considerable challenges to the current governance infrastructure. Furthermore, the speed of development and deployment of applications outpaces the ability of existing governance institutions to put in place effective ethical-legal oversight. New approaches for agile, distributed and multilevel governance are needed. This work presents a practical framework for multilevel governance of AIS. The framework enables mapping actors onto six levels of decision-making including the international, national and organizational levels. Furthermore, it offers the ability to identify and evolve existing tools or create new tools for guiding the behavior of actors within the levels. Governance mechanisms enable actors to shape and enforce regulations and other tools, which when complemented with good practices contribute to effective and comprehensive governance.
- [1783] arXiv:2404.13733 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Elucidating the Design Space of Dataset CondensationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.
- [1784] arXiv:2404.13736 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Interval Abstractions for Robust Counterfactual ExplanationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Counterfactual Explanations (CEs) have emerged as a major paradigm in explainable AI research, providing recourse recommendations for users affected by the decisions of machine learning models. However, when slight changes occur in the parameters of the underlying model, CEs found by existing methods often become invalid for the updated models. The literature lacks a way to certify deterministic robustness guarantees for CEs under model changes, in that existing methods to improve CEs' robustness are heuristic, and the robustness performances are evaluated empirically using only a limited number of retrained models. To bridge this gap, we propose a novel interval abstraction technique for parametric machine learning models, which allows us to obtain provable robustness guarantees of CEs under the possibly infinite set of plausible model changes $\Delta$. We formalise our robustness notion as the $\Delta$-robustness for CEs, in both binary and multi-class classification settings. We formulate procedures to verify $\Delta$-robustness based on Mixed Integer Linear Programming, using which we further propose two algorithms to generate CEs that are $\Delta$-robust. In an extensive empirical study, we demonstrate how our approach can be used in practice by discussing two strategies for determining the appropriate hyperparameter in our method, and we quantitatively benchmark the CEs generated by eleven methods, highlighting the effectiveness of our algorithms in finding robust CEs.
- [1785] arXiv:2404.13737 (cross-list from cs.DS) [ pdf , ps , html , other ]
-
Title: Stochastic Multi-round Submodular Optimization with BudgetSubjects: Data Structures and Algorithms (cs.DS) ; Artificial Intelligence (cs.AI)
Abstract: In this work we study the problem of Stochastic Budgeted Multi-round Submodular Maximization (SBMSm), in which we would like to maximize the sum over multiple rounds of the value of a monotone and submodular objective function, subject to the fact that the values of this function depend on the realization of stochastic events and the number of observations that we can make over all rounds is limited by a given budget. This problem extends, and generalizes to multiple round settings, well-studied problems such as (adaptive) influence maximization and stochastic probing.
We first show that whenever a certain single-round optimization problem can be optimally solved in polynomial time, then there is a polynomial time dynamic programming algorithm that returns the same solution as the optimal algorithm, that can adaptively choose both which observations to make and in which round to have them. Unfortunately, this dynamic programming approach cannot be extended to work when the single-round optimization problem cannot be efficiently solved (even if we allow it would be approximated within an arbitrary small constant). Anyway, in this case we are able to provide a simple greedy algorithm for the problem. It guarantees a $(1/2-\epsilon)$-approximation to the optimal value, even if it non-adaptively allocates the budget to rounds. - [1786] arXiv:2404.13742 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Seamless Underwater Navigation with Limited Doppler Velocity Log MeasurementsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Autonomous Underwater Vehicles (AUVs) commonly utilize an inertial navigation system (INS) and a Doppler velocity log (DVL) for underwater navigation. To that end, their measurements are integrated through a nonlinear filter such as the extended Kalman filter (EKF). The DVL velocity vector estimate depends on retrieving reflections from the seabed, ensuring that at least three out of its four transmitted acoustic beams return successfully. When fewer than three beams are obtained, the DVL cannot provide a velocity update to bind the navigation solution drift. To cope with this challenge, in this paper, we propose a hybrid neural coupled (HNC) approach for seamless AUV navigation in situations of limited DVL measurements. First, we drive an approach to regress two or three missing DVL beams. Then, those beams, together with the measured beams, are incorporated into the EKF. We examined INS/DVL fusion both in loosely and tightly coupled approaches. Our method was trained and evaluated on recorded data from AUV experiments conducted in the Mediterranean Sea on two different occasions. The results illustrate that our proposed method outperforms the baseline loosely and tightly coupled model-based approaches by an average of 96.15%. It also demonstrates superior performance compared to a model-based beam estimator by an average of 12.41% in terms of velocity accuracy for scenarios involving two or three missing beams. Therefore, we demonstrate that our approach offers seamless AUV navigation in situations of limited beam measurements.
- [1787] arXiv:2404.13745 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Nasal Cytology Dataset for Object Detection and Deep LearningMauro Camporeale , Giovanni Dimauro , Matteo Gelardi , Giorgia Iacobellis , Mattia Sebastiano Ladisa , Sergio Latrofa , Nunzia LomonteComments: Pre Print almost ready to be submittedSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Nasal Cytology is a new and efficient clinical technique to diagnose rhinitis and allergies that is not much widespread due to the time-consuming nature of cell counting; that is why AI-aided counting could be a turning point for the diffusion of this technique. In this article we present the first dataset of rhino-cytological field images: the NCD (Nasal Cytology Dataset), aimed to train and deploy Object Detection models to support physicians and biologists during clinical practice. The real distribution of the cytotypes, populating the nasal mucosa has been replicated, sampling images from slides of clinical patients, and manually annotating each cell found on them. The correspondent object detection task presents non'trivial issues associated with the strong class imbalancement, involving the rarest cell types. This work contributes to some of open challenges by presenting a novel machine learning-based approach to aid the automated detection and classification of nasal mucosa cells: the DETR and YOLO models shown good performance in detecting cells and classifying them correctly, revealing great potential to accelerate the work of rhinology experts.
- [1788] arXiv:2404.13752 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards General Conceptual Model Editing via Adversarial Representation EngineeringSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
Abstract: Recent research has introduced Representation Engineering (RepE) as a promising approach for understanding complex inner workings of large-scale models like Large Language Models (LLMs). However, finding practical and efficient methods to apply these representations for general and flexible model editing remains an open problem. Inspired by the Generative Adversarial Network (GAN) framework, we introduce a novel approach called Adversarial Representation Engineering (ARE). This method leverages RepE by using a representation sensor to guide the editing of LLMs, offering a unified and interpretable framework for conceptual model editing without degrading baseline performance. Our experiments on multiple conceptual editing confirm ARE's effectiveness. Code and data are available at this https URL .
- [1789] arXiv:2404.13767 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Autonomous Robot for Disaster Mapping and Victim LocalizationComments: Class final project for Northeastern University EECE 5550 Mobile Robotics CourseSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In response to the critical need for effective reconnaissance in disaster scenarios, this research article presents the design and implementation of a complete autonomous robot system using the Turtlebot3 with Robotic Operating System (ROS) Noetic. Upon deployment in closed, initially unknown environments, the system aims to generate a comprehensive map and identify any present 'victims' using AprilTags as stand-ins. We discuss our solution for search and rescue missions, while additionally exploring more advanced algorithms to improve search and rescue functionalities. We introduce a Cubature Kalman Filter to help reduce the mean squared error [m] for AprilTag localization and an information-theoretic exploration algorithm to expedite exploration in unknown environments. Just like turtles, our system takes it slow and steady, but when it's time to save the day, it moves at ninja-like speed! Despite Donatello's shell, he's no slowpoke - he zips through obstacles with the agility of a teenage mutant ninja turtle. So, hang on tight to your shells and get ready for a whirlwind of reconnaissance!
Full pipeline code this https URL
Exploration code this https URL - [1790] arXiv:2404.13770 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting AutoencoderComments: 15 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image of its class. Our prior work that applied the Converting Autoencoder and a simple classifier in tandem achieved moderate accuracy over simple datasets, such as MNIST and FMNIST. However, on more complex datasets like CIFAR-10, the Converting Autoencoder has a large reconstruction loss, making it unsuitable for enhancing DNN accuracy. To address these limitations, we generalize the design of Converting Autoencoders by leveraging a larger class of DNNs, those with architectures comprising feature extraction layers followed by classification layers. We incorporate a generalized algorithmic design of the Converting Autoencoder and intraclass clustering to identify representative images, leading to optimized image feature learning. Next, we demonstrate the effectiveness of our EncodeNet design and training framework, improving the accuracy of well-trained baseline DNNs while maintaining the overall model size. EncodeNet's building blocks comprise the trained encoder from our generalized Converting Autoencoders transferring knowledge to a lightweight classifier network - also extracted from the baseline DNN. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100. It outperforms state-of-the-art techniques that rely on knowledge distillation and attention mechanisms, delivering higher accuracy for models of comparable size.
- [1791] arXiv:2404.13786 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous DrivingShuyao Shi , Neiwen Ling , Zhehao Jiang , Xuan Huang , Yuze He , Xiaoguang Zhao , Bufang Yang , Chen Bian , Jingfei Xia , Zhenyu Yan , Raymond Yeung , Guoliang XingSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Abstract: Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components carefully designed to overcome various system and physical challenges. Soar can leverage the existing operational infrastructure like street lampposts for a lower barrier of adoption. Soar adopts a new communication architecture that comprises a bi-directional multi-hop I2I network and a downlink I2V broadcast service, which are designed based on off-the-shelf 802.11ac interfaces in an integrated manner. Soar also features a hierarchical DL task management framework to achieve desirable load balancing among nodes and enable them to collaborate efficiently to run multiple data-intensive autonomous driving applications. We deployed a total of 18 Soar nodes on existing lampposts on campus, which have been operational for over two years. Our real-world evaluation shows that Soar can support a diverse set of autonomous driving applications and achieve desirable real-time performance and high communication reliability. Our findings and experiences in this work offer key insights into the development and deployment of next-generation smart roadside infrastructure and autonomous driving systems.
- [1792] arXiv:2404.13788 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AnyPattern: Towards In-context Image Copy DetectionComments: The project is publicly available at this https URL . arXiv admin note: text overlap with arXiv:2403.06098Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores in-context learning for image copy detection (ICD), i.e., prompting an ICD model to identify replicated images with new tampering patterns without the need for additional training. The prompts (or the contexts) are from a small set of image-replica pairs that reflect the new patterns and are used at inference time. Such in-context ICD has good realistic value, because it requires no fine-tuning and thus facilitates fast reaction against the emergence of unseen patterns. To accommodate the "seen $\rightarrow$ unseen" generalization scenario, we construct the first large-scale pattern dataset named AnyPattern, which has the largest number of tamper patterns ($90$ for training and $10$ for testing) among all the existing ones. We benchmark AnyPattern with popular ICD methods and reveal that existing methods barely generalize to novel patterns. We further propose a simple in-context ICD method named ImageStacker. ImageStacker learns to select the most representative image-replica pairs and employs them as the pattern prompts in a stacking manner (rather than the popular concatenation manner). Experimental results show (1) training with our large-scale dataset substantially benefits pattern generalization ($+26.66 \%$ $\mu AP$), (2) the proposed ImageStacker facilitates effective in-context ICD (another round of $+16.75 \%$ $\mu AP$), and (3) AnyPattern enables in-context ICD, i.e., without such a large-scale dataset, in-context learning does not emerge even with our ImageStacker. Beyond the ICD task, we also demonstrate how AnyPattern can benefit artists, i.e., the pattern retrieval method trained on AnyPattern can be generalized to identify style mimicry by text-to-image models. The project is publicly available at this https URL .
- [1793] arXiv:2404.13789 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Anchor-aware Deep Metric Learning for Audio-visual RetrievalComments: 9 pages, 5 figures. Accepted by ACM ICMR 2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Abstract: Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.
- [1794] arXiv:2404.13791 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Universal Fingerprint Generation: Controllable Diffusion Model with Multimodal ConditionsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produce fingerprint images of various types while maintaining identity and offering humanly understandable control over different appearance factors such as fingerprint class, acquisition type, sensor device, and quality level. Unlike previous fingerprint generation approaches, GenPrint is not confined to replicating style characteristics from the training dataset alone: it enables the generation of novel styles from unseen devices without requiring additional fine-tuning. To accomplish these objectives, we developed GenPrint using latent diffusion models with multimodal conditions (text and image) for consistent generation of style and identity. Our experiments leverage a variety of publicly available datasets for training and evaluation. Results demonstrate the benefits of GenPrint in terms of identity preservation, explainable control, and universality of generated images. Importantly, the GenPrint-generated images yield comparable or even superior accuracy to models trained solely on real data and further enhances performance when augmenting the diversity of existing real fingerprint datasets.
- [1795] arXiv:2404.13792 (cross-list from cs.MM) [ pdf , ps , html , other ]
-
Title: Counterfactual Reasoning Using Predicted Latent Personality Dimensions for Optimizing Persuasion OutcomeDonghuo Zeng , Roberto S. Legaspi , Yuewen Sun , Xinshuai Dong , Kazushi Ikeda , Peter Spirtes , kun ZhangComments: 14 pages, 10 figures, Accepted by Persuasive Technology 2024Subjects: Multimedia (cs.MM) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: Customizing persuasive conversations related to the outcome of interest for specific users achieves better persuasion results. However, existing persuasive conversation systems rely on persuasive strategies and encounter challenges in dynamically adjusting dialogues to suit the evolving states of individual users during interactions. This limitation restricts the system's ability to deliver flexible or dynamic conversations and achieve suboptimal persuasion outcomes. In this paper, we present a novel approach that tracks a user's latent personality dimensions (LPDs) during ongoing persuasion conversation and generates tailored counterfactual utterances based on these LPDs to optimize the overall persuasion outcome. In particular, our proposed method leverages a Bi-directional Generative Adversarial Network (BiCoGAN) in tandem with a Dialogue-based Personality Prediction Regression (DPPR) model to generate counterfactual data. This enables the system to formulate alternative persuasive utterances that are more suited to the user. Subsequently, we utilize the D3QN model to learn policies for optimized selection of system utterances on counterfactual data. Experimental results we obtained from using the PersuasionForGood dataset demonstrate the superiority of our approach over the existing method, BiCoGAN. The cumulative rewards and Q-values produced by our method surpass ground truth benchmarks, showcasing the efficacy of employing counterfactual reasoning and LPDs to optimize reinforcement learning policy in online interactions.
- [1796] arXiv:2404.13812 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: A Comparative Study on Enhancing Prediction in Social Network Advertisement through Data AugmentationComments: Accepted by 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE)Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: In the ever-evolving landscape of social network advertising, the volume and accuracy of data play a critical role in the performance of predictive models. However, the development of robust predictive algorithms is often hampered by the limited size and potential bias present in real-world datasets. This study presents and explores a generative augmentation framework of social network advertising data. Our framework explores three generative models for data augmentation - Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Gaussian Mixture Models (GMMs) - to enrich data availability and diversity in the context of social network advertising analytics effectiveness. By performing synthetic extensions of the feature space, we find that through data augmentation, the performance of various classifiers has been quantitatively improved. Furthermore, we compare the relative performance gains brought by each data augmentation technique, providing insights for practitioners to select appropriate techniques to enhance model performance. This paper contributes to the literature by showing that synthetic data augmentation alleviates the limitations imposed by small or imbalanced datasets in the field of social network advertising. At the same time, this article also provides a comparative perspective on the practicality of different data augmentation methods, thereby guiding practitioners to choose appropriate techniques to enhance model performance.
- [1797] arXiv:2404.13813 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From LLM to NMT: Advancing Low-Resource Machine Translation with ClaudeComments: 17 pages, 15 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We show that Claude 3 Opus, a large language model (LLM) released by Anthropic in March 2024, exhibits stronger machine translation competence than other LLMs. Though we find evidence of data contamination with Claude on FLORES-200, we curate new benchmarks that corroborate the effectiveness of Claude for low-resource machine translation into English. We find that Claude has remarkable \textit{resource efficiency} -- the degree to which the quality of the translation model depends on a language pair's resource level. Finally, we show that advancements in LLM translation can be compressed into traditional neural machine translation (NMT) models. Using Claude to generate synthetic data, we demonstrate that knowledge distillation advances the state-of-the-art in Yoruba-English translation, meeting or surpassing strong baselines like NLLB-54B and Google Translate.
- [1798] arXiv:2404.13841 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Fair Concurrent Training of Multiple Models in Federated LearningMarie Siew , Haoran Zhang , Jong-Ik Park , Yuezhou Liu , Yichen Ruan , Lili Su , Stratis Ioannidis , Edmund Yeh , Carlee Joe-WongSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning (FL) enables collaborative learning across multiple clients. In most FL work, all clients train a single learning task. However, the recent proliferation of FL applications may increasingly require multiple FL tasks to be trained simultaneously, sharing clients' computing and communication resources, which we call Multiple-Model Federated Learning (MMFL). Current MMFL algorithms use naive average-based client-task allocation schemes that can lead to unfair performance when FL tasks have heterogeneous difficulty levels, e.g., tasks with larger models may need more rounds and data to train. Just as naively allocating resources to generic computing jobs with heterogeneous resource needs can lead to unfair outcomes, naive allocation of clients to FL tasks can lead to unfairness, with some tasks having excessively long training times, or lower converged accuracies. Furthermore, in the FL setting, since clients are typically not paid for their training effort, we face a further challenge that some clients may not even be willing to train some tasks, e.g., due to high computational costs, which may exacerbate unfairness in training outcomes across tasks. We address both challenges by firstly designing FedFairMMFL, a difficulty-aware algorithm that dynamically allocates clients to tasks in each training round. We provide guarantees on airness and FedFairMMFL's convergence rate. We then propose a novel auction design that incentivizes clients to train multiple tasks, so as to fairly distribute clients' training efforts across the tasks. We show how our fairness-based learning and incentive mechanisms impact training convergence and finally evaluate our algorithm with multiple sets of learning tasks on real world datasets.
- [1799] arXiv:2404.13844 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ColA: Collaborative Adaptation with Gradient LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A primary function of back-propagation is to compute both the gradient of hidden representations and parameters for optimization with gradient descent. Training large models requires high computational costs due to their vast parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to train smaller auxiliary models to save computational space, they still present computational overheads, especially in Fine-Tuning as a Service (FTaaS) for numerous users. We introduce Collaborative Adaptation (ColA) with Gradient Learning (GL), a parameter-free, model-agnostic fine-tuning approach that decouples the computation of the gradient of hidden representations and parameters. In comparison to PEFT methods, ColA facilitates more cost-effective FTaaS by offloading the computation of the gradient to low-cost devices. We also provide a theoretical analysis of ColA and experimentally demonstrate that ColA can perform on par or better than existing PEFT methods on various benchmarks.
- [1800] arXiv:2404.13846 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Filtered Direct Preference OptimizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on Direct Preference Optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at this https URL .
- [1801] arXiv:2404.13859 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Unveiling and Mitigating Generalized Biases of DNNs through the Intrinsic Dimensions of Perceptual ManifoldsYanbiao Ma , Licheng Jiao , Fang Liu , Lingling Li , Wenping Ma , Shuyuan Yang , Xu Liu , Puhua ChenComments: 8pages, 6figures, Submitted to TPAMISubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Building fair deep neural networks (DNNs) is a crucial step towards achieving trustworthy artificial intelligence. Delving into deeper factors that affect the fairness of DNNs is paramount and serves as the foundation for mitigating model biases. However, current methods are limited in accurately predicting DNN biases, relying solely on the number of training samples and lacking more precise measurement tools. Here, we establish a geometric perspective for analyzing the fairness of DNNs, comprehensively exploring how DNNs internally shape the intrinsic geometric characteristics of datasets-the intrinsic dimensions (IDs) of perceptual manifolds, and the impact of IDs on the fairness of DNNs. Based on multiple findings, we propose Intrinsic Dimension Regularization (IDR), which enhances the fairness and performance of models by promoting the learning of concise and ID-balanced class perceptual manifolds. In various image recognition benchmark tests, IDR significantly mitigates model bias while improving its performance.
- [1802] arXiv:2404.13885 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Surveying Attitudinal Alignment Between Large Language Models Vs. Humans Towards 17 Sustainable Development GoalsQingyang Wu , Ying Xu , Tingsong Xiao , Yunze Xiao , Yitong Li , Tianyang Wang , Yichi Zhang , Shanghai Zhong , Yuwei Zhang , Wei Lu , Yifan YangSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have emerged as potent tools for advancing the United Nations' Sustainable Development Goals (SDGs). However, the attitudinal disparities between LLMs and humans towards these goals can pose significant challenges. This study conducts a comprehensive review and analysis of the existing literature on the attitudes of LLMs towards the 17 SDGs, emphasizing the comparison between their attitudes and support for each goal and those of humans. We examine the potential disparities, primarily focusing on aspects such as understanding and emotions, cultural and regional differences, task objective variations, and factors considered in the decision-making process. These disparities arise from the underrepresentation and imbalance in LLM training data, historical biases, quality issues, lack of contextual understanding, and skewed ethical values reflected. The study also investigates the risks and harms that may arise from neglecting the attitudes of LLMs towards the SDGs, including the exacerbation of social inequalities, racial discrimination, environmental destruction, and resource wastage. To address these challenges, we propose strategies and recommendations to guide and regulate the application of LLMs, ensuring their alignment with the principles and goals of the SDGs, and therefore creating a more just, inclusive, and sustainable future.
- [1803] arXiv:2404.13891 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror DescentComments: Accepted to 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Abstract: Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. It decomposes the total regret into counterfactual regrets, utilizing local regret minimization algorithms, such as Regret Matching (RM) or RM+, to minimize them. Recent research establishes a connection between Online Mirror Descent (OMD) and RM+, paving the way for an optimistic variant PRM+ and its extension PCFR+. However, PCFR+ assigns uniform weights for each iteration when determining regrets, leading to substantial regrets when facing dominated actions. This work explores minimizing weighted counterfactual regret with optimistic OMD, resulting in a novel CFR variant PDCFR+. It integrates PCFR+ and Discounted CFR (DCFR) in a principled manner, swiftly mitigating negative effects of dominated actions and consistently leveraging predictions to accelerate convergence. Theoretical analyses prove that PDCFR+ converges to a Nash equilibrium, particularly under distinct weighting schemes for regrets and average strategies. Experimental results demonstrate PDCFR+'s fast convergence in common imperfect-information games. The code is available at this https URL .
- [1804] arXiv:2404.13892 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Retrieval-Augmented Audio Deepfake DetectionComments: Accepted by the 2024 International Conference on Multimedia Retrieval (ICMR 2024)Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
- [1805] arXiv:2404.13899 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Better Text-to-Image Generation Alignment via Attention ModulationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: In text-to-image generation tasks, the advancements of diffusion models have facilitated the fidelity of generated results. However, these models encounter challenges when processing text prompts containing multiple entities and attributes. The uneven distribution of attention results in the issues of entity leakage and attribute misalignment. Training from scratch to address this issue requires numerous labeled data and is resource-consuming. Motivated by this, we propose an attribution-focusing mechanism, a training-free phase-wise mechanism by modulation of attention for diffusion model. One of our core ideas is to guide the model to concentrate on the corresponding syntactic components of the prompt at distinct timesteps. To achieve this, we incorporate a temperature control mechanism within the early phases of the self-attention modules to mitigate entity leakage issues. An object-focused masking scheme and a phase-wise dynamic weight control mechanism are integrated into the cross-attention modules, enabling the model to discern the affiliation of semantic information between entities more effectively. The experimental results in various alignment scenarios demonstrate that our model attain better image-text alignment with minimal additional computational cost.
- [1806] arXiv:2404.13906 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Generating Attractive and Authentic Copywriting from Customer ReviewsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The goal of product copywriting is to capture the interest of potential buyers by emphasizing the features of products through text descriptions. As e-commerce platforms offer a wide range of services, it's becoming essential to dynamically adjust the styles of these auto-generated descriptions. Typical approaches to copywriting generation often rely solely on specified product attributes, which may result in dull and repetitive content. To tackle this issue, we propose to generate copywriting based on customer reviews, as they provide firsthand practical experiences with products, offering a richer source of information than just product attributes. We have developed a sequence-to-sequence framework, enhanced with reinforcement learning, to produce copywriting that is attractive, authentic, and rich in information. Our framework outperforms all existing baseline and zero-shot large language models, including LLaMA-2-chat-7B and GPT-3.5, in terms of both attractiveness and faithfulness. Furthermore, this work features the use of LLMs for aspect-based summaries collection and argument allure assessment. Experiments demonstrate the effectiveness of using LLMs for marketing domain corpus construction. The code and the dataset is publicly available at: this https URL .
- [1807] arXiv:2404.13910 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Integrated Gradient Correlation: a Dataset-wise Attribution MethodPierre Lelièvre , Chien-Chung Chen (National Taiwan University)Comments: 12 pages, 8 figures, source code at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Attribution methods are primarily designed to study the distribution of input component contributions to individual model predictions. However, some research applications require a summary of attribution patterns across the entire dataset to facilitate the interpretability of the scrutinized models. In this paper, we present a new method called Integrated Gradient Correlation (IGC) that relates dataset-wise attributions to a model prediction score and enables region-specific analysis by a direct summation over associated components. We demonstrate our method on scalar predictions with the study of image feature representation in the brain from fMRI neural signals and the estimation of neural population receptive fields (NSD dataset), as well as on categorical predictions with the investigation of handwritten digit recognition (MNIST dataset). The resulting IGC attributions show selective patterns, revealing underlying model strategies coherent with their respective objectives.
- [1808] arXiv:2404.13919 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Navigating the Path of Writing: Outline-guided Text Generation with Large Language ModelsComments: under reviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Large Language Models (LLMs) have significantly impacted the writing process, enabling collaborative content creation and enhancing productivity. However, generating high-quality, user-aligned text remains challenging. In this paper, we propose Writing Path, a framework that uses explicit outlines to guide LLMs in generating goal-oriented, high-quality pieces of writing. Our approach draws inspiration from structured writing planning and reasoning paths, focusing on capturing and reflecting user intentions throughout the writing process. We construct a diverse dataset from unstructured blog posts to benchmark writing performance and introduce a comprehensive evaluation framework assessing the quality of outlines and generated texts. Our evaluations with GPT-3.5-turbo, GPT-4, and HyperCLOVA X demonstrate that the Writing Path approach significantly enhances text quality according to both LLMs and human evaluations. This study highlights the potential of integrating writing-specific techniques into LLMs to enhance their ability to meet the diverse writing needs of users.
- [1809] arXiv:2404.13941 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Autoencoder-assisted Feature Ensemble Net for Incipient FaultsSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep learning has shown the great power in the field of fault detection. However, for incipient faults with tiny amplitude, the detection performance of the current deep learning networks (DLNs) is not satisfactory. Even if prior information about the faults is utilized, DLNs can't successfully detect faults 3, 9 and 15 in Tennessee Eastman process (TEP). These faults are notoriously difficult to detect, lacking effective detection technologies in the field of fault detection. In this work, we propose Autoencoder-assisted Feature Ensemble Net (AE-FENet): a deep feature ensemble framework that uses the unsupervised autoencoder to conduct the feature transformation. Compared with the principle component analysis (PCA) technique adopted in the original Feature Ensemble Net (FENet), autoencoder can mine more exact features on incipient faults, which results in the better detection performance of AE-FENet. With same kinds of basic detectors, AE-FENet achieves a state-of-the-art average accuracy over 96% on faults 3, 9 and 15 in TEP, which represents a significant enhancement in performance compared to other methods. Plenty of experiments have been done to extend our framework, proving that DLNs can be utilized efficiently within this architecture.
- [1810] arXiv:2404.13954 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A survey of air combat behavior modeling using machine learningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: With the recent advances in machine learning, creating agents that behave realistically in simulated air combat has become a growing field of interest. This survey explores the application of machine learning techniques for modeling air combat behavior, motivated by the potential to enhance simulation-based pilot training. Current simulated entities tend to lack realistic behavior, and traditional behavior modeling is labor-intensive and prone to loss of essential domain knowledge between development steps. Advancements in reinforcement learning and imitation learning algorithms have demonstrated that agents may learn complex behavior from data, which could be faster and more scalable than manual methods. Yet, making adaptive agents capable of performing tactical maneuvers and operating weapons and sensors still poses a significant challenge. The survey examines applications, behavior model types, prevalent machine learning methods, and the technical and human challenges in developing adaptive and realistically behaving agents. Another challenge is the transfer of agents from learning environments to military simulation systems and the consequent demand for standardization. Four primary recommendations are presented regarding increased emphasis on beyond-visual-range scenarios, multi-agent machine learning and cooperation, utilization of hierarchical behavior models, and initiatives for standardization and research collaboration. These recommendations aim to address current issues and guide the development of more comprehensive, adaptable, and realistic machine learning-based behavior models for air combat applications.
- [1811] arXiv:2404.13968 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Protecting Your LLMs with Information BottleneckZichuan Liu , Zefan Wang , Linjie Xu , Jinyu Wang , Lei Song , Tianchun Wang , Chunlin Chen , Wei Cheng , Jiang BianSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: The advent of large language models (LLMs) has revolutionized the field of natural language processing, yet they might be attacked to produce harmful content. Despite efforts to ethically align LLMs, these are often fragile and can be circumvented by jailbreaking attacks through optimized or manual adversarial prompts. To address this, we introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle, and we modify the objective to avoid trivial solutions. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor, preserving only essential information for the target LLMs to respond with the expected answer. Moreover, we further consider a situation where the gradient is not visible to be compatible with any LLM. Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts, without overly affecting response quality or inference speed. Its effectiveness and adaptability across various attack methods and target LLMs underscore the potential of IBProtector as a novel, transferable defense that bolsters the security of LLMs without requiring modifications to the underlying models.
- [1812] arXiv:2404.14007 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Infusion: Preventing Customized Text-to-Image Diffusion from OverfittingComments: 10 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.
- [1813] arXiv:2404.14061 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedTAD: Topology-aware Data-free Knowledge Distillation for Subgraph Federated LearningComments: Accepted by IJCAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
Abstract: Subgraph federated learning (subgraph-FL) is a new distributed paradigm that facilitates the collaborative training of graph neural networks (GNNs) by multi-client subgraphs. Unfortunately, a significant challenge of subgraph-FL arises from subgraph heterogeneity, which stems from node and topology variation, causing the impaired performance of the global GNN. Despite various studies, they have not yet thoroughly investigated the impact mechanism of subgraph heterogeneity. To this end, we decouple node and topology variation, revealing that they correspond to differences in label distribution and structure homophily. Remarkably, these variations lead to significant differences in the class-wise knowledge reliability of multiple local GNNs, misguiding the model aggregation with varying degrees. Building on this insight, we propose topology-aware data-free knowledge distillation technology (FedTAD), enhancing reliable knowledge transfer from the local model to the global model. Extensive experiments on six public datasets consistently demonstrate the superiority of FedTAD over state-of-the-art baselines.
- [1814] arXiv:2404.14073 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Robust Trajectory Representations: Isolating Environmental Confounders with Causal LearningComments: The paper has been accepted by IJCAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Trajectory modeling refers to characterizing human movement behavior, serving as a pivotal step in understanding mobility patterns. Nevertheless, existing studies typically ignore the confounding effects of geospatial context, leading to the acquisition of spurious correlations and limited generalization capabilities. To bridge this gap, we initially formulate a Structural Causal Model (SCM) to decipher the trajectory representation learning process from a causal perspective. Building upon the SCM, we further present a Trajectory modeling framework (TrajCL) based on Causal Learning, which leverages the backdoor adjustment theory as an intervention tool to eliminate the spurious correlations between geospatial context and trajectories. Extensive experiments on two real-world datasets verify that TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.
- [1815] arXiv:2404.14117 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Hierarchical localization with panoramic views and triplet loss functionsComments: This work has been submitted to the Artificial Intelligence Journal (Ed. Elsevier) for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The main objective of this paper is to address the mobile robot localization problem with Triplet Convolutional Neural Networks and test their robustness against changes of the lighting conditions. We have used omnidirectional images from real indoor environments captured in dynamic conditions that have been converted to panoramic format. Two approaches are proposed to address localization by means of triplet neural networks. First, hierarchical localization, which consists in estimating the robot position in two stages: a coarse localization, which involves a room retrieval task, and a fine localization is addressed by means of image retrieval in the previously selected room. Second, global localization, which consists in estimating the position of the robot inside the entire map in a unique step. Besides, an exhaustive study of the loss function influence on the network learning process has been made. The experimental section proves that triplet neural networks are an efficient and robust tool to address the localization of mobile robots in indoor environments, considering real operation conditions.
- [1816] arXiv:2404.14133 (cross-list from astro-ph.HE) [ pdf , ps , html , other ]
-
Title: Quantum Convolutional Neural Networks for the detection of Gamma-Ray Bursts in the AGILE space mission dataA. Rizzo , N. Parmiggiani , A. Bulgarelli , A. Macaluso , V. Fioretti , L. Castaldini , A. Di Piano , G. Panebianco , C. Pittori , M. Tavani , C. Sartori , C. Burigana , V. Cardone , F. Farsian , M. Meneghetti , G. Murante , R. Scaramella , F. Schillirò , V. Testa , T. TrombettiComments: 4 pages, 2 figures, proceedings of the ADASS XXXIII (2023) conference, to appear in ASP Conference SerieSubjects: High Energy Astrophysical Phenomena (astro-ph.HE) ; Artificial Intelligence (cs.AI)
Abstract: Quantum computing represents a cutting-edge frontier in artificial intelligence. It makes use of hybrid quantum-classical computation which tries to leverage quantum mechanic principles that allow us to use a different approach to deep learning classification problems. The work presented here falls within the context of the AGILE space mission, launched in 2007 by the Italian Space Agency. We implement different Quantum Convolutional Neural Networks (QCNN) that analyze data acquired by the instruments onboard AGILE to detect Gamma-Ray Bursts from sky maps or light curves. We use several frameworks such as TensorFlow-Quantum, Qiskit and PennyLane to simulate a quantum computer. We achieved an accuracy of 95.1% on sky maps with QCNNs, while the classical counterpart achieved 98.8% on the same data, using however hundreds of thousands more parameters.
- [1817] arXiv:2404.14161 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multidimensional InterpolantsComments: 9 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In the domain of differential equation-based generative modeling, conventional approaches often rely on single-dimensional scalar values as interpolation coefficients during both training and inference phases. In this work, we introduce, for the first time, a multidimensional interpolant that extends these coefficients into multiple dimensions, leveraging the stochastic interpolant framework. Additionally, we propose a novel path optimization problem tailored to adaptively determine multidimensional inference trajectories, with a predetermined differential equation solver and a fixed number of function evaluations. Our solution involves simulation dynamics coupled with adversarial training to optimize the inference path. Notably, employing a multidimensional interpolant during training improves the model's inference performance, even in the absence of path optimization. When the adaptive, multidimensional path derived from our optimization process is employed, it yields further performance gains, even with fixed solver configurations. The introduction of multidimensional interpolants not only enhances the efficacy of models but also opens up a new domain for exploration in training and inference methodologies, emphasizing the potential of multidimensional paths as an untapped frontier.
- [1818] arXiv:2404.14198 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: BCFPL: Binary classification ConvNet based Fast Parking space recognition with Low resolution imageSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The automobile plays an important role in the economic activities of mankind, especially in the metropolis. Under the circumstances, the demand of quick search for available parking spaces has become a major concern for the automobile drivers. Meanwhile, the public sense of privacy is also awaking, the image-based parking space recognition methods lack the attention of privacy protection. In this paper, we proposed a binary convolutional neural network with lightweight design structure named BCFPL, which can be used to train with low-resolution parking space images and offer a reasonable recognition result. The images of parking space were collected from various complex environments, including different weather, occlusion conditions, and various camera angles. We conducted the training and testing progresses among different datasets and partial subsets. The experimental results show that the accuracy of BCFPL does not decrease compared with the original resolution image directly, and can reach the average level of the existing mainstream method. BCFPL also has low hardware requirements and fast recognition speed while meeting the privacy requirements, so it has application potential in intelligent city construction and automatic driving field.
- [1819] arXiv:2404.14219 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your PhoneMarah Abdin , Sam Ade Jacobs , Ammar Ahmad Awan , Jyoti Aneja , Ahmed Awadallah , Hany Awadalla , Nguyen Bach , Amit Bahree , Arash Bakhtiari , Harkirat Behl , Alon Benhaim , Misha Bilenko , Johan Bjorck , Sébastien Bubeck , Martin Cai , Caio César Teodoro Mendes , Weizhu Chen , Vishrav Chaudhary , Parul Chopra , Allie Del Giorno , Gustavo de Rosa , Matthew Dixon , Ronen Eldan , Dan Iter , Amit Garg , Abhishek Goswami , Suriya Gunasekar , Emman Haider , Junheng Hao , Russell J. Hewett , Jamie Huynh , Mojan Javaheripi , Xin Jin , Piero Kauffmann , Nikos Karampatziakis , Dongwoo Kim , Mahoud Khademi , Lev Kurilenko , James R. Lee , Yin Tat Lee , Yuanzhi Li , Chen Liang , Weishung Liu , Eric Lin , Zeqi Lin , Piyush Madan , Arindam Mitra , Hardik Modi , Anh Nguyen , Brandon Norick , Barun Patra , Daniel Perez-Becker , Thomas Portet , Reid Pryzant , Heyang Qin , Marko Radmilac , Corby Rosset , Sambudha Roy , Olatunji Ruwase , Olli Saarikivi , Amin Saied , Adil Salim , Michael Santacroce , Shital Shah , Ning Shang , Hiteshi Sharma , Xia Song , Masahiro Tanaka , Xin Wang , Rachel Ward , Guanhua Wang , Philipp Witte , Michael Wyatt , Can Xu , Jiahang Xu , Sonali Yadav , Fan Yang , Ziyi Yang , Donghan Yu , Chengruidong Zhang , Cyril Zhang , Jianwen Zhang , Li Lyna Zhang , Yi Zhang , Yue Zhang , Yunan Zhang , Xiren ZhouComments: 12 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).
- [1820] arXiv:2404.14232 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Shifting Focus with HCEye: Exploring the Dynamics of Visual Highlighting and Cognitive Load on User Attention and Saliency PredictionComments: 18 pages, 9 Figures, Conference: ACM Symposium on Eye Tracking Research & Applications (ETRA); Journal: Proc. ACM Hum.-Comput. Interact., Vol. 8, No. ETRA, Article 236. Publication date: May 2024Journal-ref: Proc. ACM Hum.-Comput. Interact., Vol. 8, No. ETRA, Article 236. Publication date: May 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Visual highlighting can guide user attention in complex interfaces. However, its effectiveness under limited attentional capacities is underexplored. This paper examines the joint impact of visual highlighting (permanent and dynamic) and dual-task-induced cognitive load on gaze behaviour. Our analysis, using eye-movement data from 27 participants viewing 150 unique webpages reveals that while participants' ability to attend to UI elements decreases with increasing cognitive load, dynamic adaptations (i.e., highlighting) remain attention-grabbing. The presence of these factors significantly alters what people attend to and thus what is salient. Accordingly, we show that state-of-the-art saliency models increase their performance when accounting for different cognitive loads. Our empirical insights, along with our openly available dataset, enhance our understanding of attentional processes in UIs under varying cognitive (and perceptual) loads and open the door for new models that can predict user attention while multitasking.
- [1821] arXiv:2404.14233 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI FeedbackWenyi Xiao , Ziwei Huang , Leilei Gan , Wanggui He , Haoyuan Li , Zhelun Yu , Hao Jiang , Fei Wu , Linchao ZhuSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.
- [1822] arXiv:2404.14238 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Beyond the Edge: An Advanced Exploration of Reinforcement Learning for Mobile Edge Computing, its Applications, and Future Research TrajectoriesComments: The paper is accepted by IEEE Communications Surveys and Tutorials (COMST)Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: Mobile Edge Computing (MEC) broadens the scope of computation and storage beyond the central network, incorporating edge nodes close to end devices. This expansion facilitates the implementation of large-scale "connected things" within edge networks. The advent of applications necessitating real-time, high-quality service presents several challenges, such as low latency, high data rate, reliability, efficiency, and security, all of which demand resolution. The incorporation of reinforcement learning (RL) methodologies within MEC networks promotes a deeper understanding of mobile user behaviors and network dynamics, thereby optimizing resource use in computing and communication processes. This paper offers an exhaustive survey of RL applications in MEC networks, initially presenting an overview of RL from its fundamental principles to the latest advanced frameworks. Furthermore, it outlines various RL strategies employed in offloading, caching, and communication within MEC networks. Finally, it explores open issues linked with software and hardware platforms, representation, RL robustness, safe RL, large-scale scheduling, generalization, security, and privacy. The paper proposes specific RL techniques to mitigate these issues and provides insights into their practical applications.
- [1823] arXiv:2404.14240 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Collaborative Filtering Based on Diffusion Models: Unveiling the Potential of High-Order ConnectivityComments: 10 pages, 6 figures, 4 tables; 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024) (to appear) (Please cite our conference version.)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: A recent study has shown that diffusion models are well-suited for modeling the generative process of user-item interactions in recommender systems due to their denoising nature. However, existing diffusion model-based recommender systems do not explicitly leverage high-order connectivities that contain crucial collaborative signals for accurate recommendations. Addressing this gap, we propose CF-Diff, a new diffusion model-based collaborative filtering (CF) method, which is capable of making full use of collaborative signals along with multi-hop neighbors. Specifically, the forward-diffusion process adds random noise to user-item interactions, while the reverse-denoising process accommodates our own learning model, named cross-attention-guided multi-hop autoencoder (CAM-AE), to gradually recover the original user-item interactions. CAM-AE consists of two core modules: 1) the attention-aided AE module, responsible for precisely learning latent representations of user-item interactions while preserving the model's complexity at manageable levels, and 2) the multi-hop cross-attention module, which judiciously harnesses high-order connectivity information to capture enhanced collaborative signals. Through comprehensive experiments on three real-world datasets, we demonstrate that CF-Diff is (a) Superior: outperforming benchmark recommendation methods, achieving remarkable gains up to 7.29% compared to the best competitor, (b) Theoretically-validated: reducing computations while ensuring that the embeddings generated by our model closely approximate those from the original cross-attention, and (c) Scalable: proving the computational efficiency that scales linearly with the number of users or items.
- [1824] arXiv:2404.14241 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain AdaptationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap.
- [1825] arXiv:2404.14243 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Turbo-CF: Matrix Decomposition-Free Graph Filtering for Fast RecommendationComments: 5 pages, 4 figures, 4 tables; 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024) (to appear) (Please cite our conference version.)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: A series of graph filtering (GF)-based collaborative filtering (CF) showcases state-of-the-art performance on the recommendation accuracy by using a low-pass filter (LPF) without a training process. However, conventional GF-based CF approaches mostly perform matrix decomposition on the item-item similarity graph to realize the ideal LPF, which results in a non-trivial computational cost and thus makes them less practical in scenarios where rapid recommendations are essential. In this paper, we propose Turbo-CF, a GF-based CF method that is both training-free and matrix decomposition-free. Turbo-CF employs a polynomial graph filter to circumvent the issue of expensive matrix decompositions, enabling us to make full use of modern computer hardware components (i.e., GPU). Specifically, Turbo-CF first constructs an item-item similarity graph whose edge weights are effectively regulated. Then, our own polynomial LPFs are designed to retain only low-frequency signals without explicit matrix decompositions. We demonstrate that Turbo-CF is extremely fast yet accurate, achieving a runtime of less than 1 second on real-world benchmark datasets while achieving recommendation accuracies comparable to best competitors.
- [1826] arXiv:2404.14244 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: AI-Generated Faces in the Real World: A Large-Scale Case Study of Twitter Profile ImagesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Recent advances in the field of generative artificial intelligence (AI) have blurred the lines between authentic and machine-generated content, making it almost impossible for humans to distinguish between such media. One notable consequence is the use of AI-generated images for fake profiles on social media. While several types of disinformation campaigns and similar incidents have been reported in the past, a systematic analysis has been lacking. In this work, we conduct the first large-scale investigation of the prevalence of AI-generated profile pictures on Twitter. We tackle the challenges of a real-world measurement study by carefully integrating various data sources and designing a multi-stage detection pipeline. Our analysis of nearly 15 million Twitter profile pictures shows that 0.052% were artificially generated, confirming their notable presence on the platform. We comprehensively examine the characteristics of these accounts and their tweet content, and uncover patterns of coordinated inauthentic behavior. The results also reveal several motives, including spamming and political amplification campaigns. Our research reaffirms the need for effective detection and mitigation strategies to cope with the potential negative effects of generative AI in the future.
- [1827] arXiv:2404.14285 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping RobotsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have shown significant potential for robotics applications, particularly task planning, by harnessing their language comprehension and text generation capabilities. However, in applications such as household robotics, a critical gap remains in the personalization of these models to individual user preferences. We introduce LLM-Personalize, a novel framework with an optimization pipeline designed to personalize LLM planners for household robotics. Our LLM-Personalize framework features an LLM planner that performs iterative planning in multi-room, partially-observable household scenarios, making use of a scene graph constructed with local observations. The generated plan consists of a sequence of high-level actions which are subsequently executed by a controller. Central to our approach is the optimization pipeline, which combines imitation learning and iterative self-training to personalize the LLM planner. In particular, the imitation learning phase performs initial LLM alignment from demonstrations, and bootstraps the model to facilitate effective iterative self-training, which further explores and aligns the model to user preferences. We evaluate LLM-Personalize on Housekeep, a challenging simulated real-world 3D benchmark for household rearrangements, and show that LLM-Personalize achieves more than a 30 percent increase in success rate over existing LLM planners, showcasing significantly improved alignment with human preferences. Project page: this https URL .
- [1828] arXiv:2404.14294 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Survey on Efficient Inference for Large Language ModelsZixuan Zhou , Xuefei Ning , Ke Hong , Tianyu Fu , Jiaming Xu , Shiyao Li , Yuming Lou , Luning Wang , Zhihang Yuan , Xiuhong Li , Shengen Yan , Guohao Dai , Xiao-Ping Zhang , Yuhan Dong , Yu WangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
- [1829] arXiv:2404.14296 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Does Your Neural Code Completion Model Use My Code? A Membership Inference ApproachSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.
- [1830] arXiv:2404.14325 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Adapting to time: why nature evolved a diverse set of neuronsComments: 13 pages, 6 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Abstract: Evolution has yielded a diverse set of neurons with varying morphologies and physiological properties that impact their processing of temporal information. In addition, it is known empirically that spike timing is a significant factor in neural computations. However, despite these two observations, most neural network models deal with spatially structured inputs with synchronous time steps, while restricting variation to parameters like weights and biases. In this study, we investigate the relevance of adapting temporal parameters, like time constants and delays, in feedforward networks that map spatio-temporal spike patterns. In this context, we show that networks with richer potential dynamics are able to more easily and robustly learn tasks with temporal structure. Indeed, when adaptation was restricted to weights, networks were unable to solve most problems. We also show strong interactions between the various parameters and the advantages of adapting temporal parameters when dealing with noise in inputs and weights, which might prove useful in neuromorphic hardware design.
- [1831] arXiv:2404.14332 (cross-list from hep-ex) [ pdf , ps , html , other ]
-
Title: Full Event Particle-Level Unfolding with Variable-Length Latent Variational DiffusionAlexander Shmakov , Kevin Greif , Michael James Fenton , Aishik Ghosh , Pierre Baldi , Daniel WhitesonComments: Submission to SciPostSubjects: High Energy Physics - Experiment (hep-ex) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
Abstract: The measurements performed by particle physics experiments must account for the imperfect response of the detectors used to observe the interactions. One approach, unfolding, statistically adjusts the experimental data for detector effects. Recently, generative machine learning models have shown promise for performing unbinned unfolding in a high number of dimensions. However, all current generative approaches are limited to unfolding a fixed set of observables, making them unable to perform full-event unfolding in the variable dimensional environment of collider data. A novel modification to the variational latent diffusion model (VLD) approach to generative unfolding is presented, which allows for unfolding of high- and variable-dimensional feature spaces. The performance of this method is evaluated in the context of semi-leptonic top quark pair production at the Large Hadron Collider.
- [1832] arXiv:2404.14349 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Automatic Discovery of Visual CircuitsComments: 14 pages, 11 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
- [1833] arXiv:2404.14355 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Calc-CMU at SemEval-2024 Task 7: Pre-Calc -- Learning to Use the Calculator Improves Numeracy in Language ModelsComments: NumEval (Task 7) at SemEval, NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Quantitative and numerical comprehension in language is an important task in many fields like education and finance, but still remains a challenging task for language models. While tool and calculator usage has shown to be helpful to improve mathematical reasoning in large pretrained decoder-only language models, this remains unexplored for smaller language models with encoders. In this paper, we propose Pre-Calc, a simple pre-finetuning objective of learning to use the calculator for both encoder-only and encoder-decoder architectures, formulated as a discriminative and generative task respectively. We pre-train BERT and RoBERTa for discriminative calculator use and Flan-T5 for generative calculator use on the MAWPS, SVAMP, and AsDiv-A datasets, which improves performance on downstream tasks that require numerical understanding. Our code and data are available at this https URL .
- [1834] arXiv:2404.14368 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Graphic Design with Large Multimodal ModelSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: In the field of graphic design, automating the integration of design elements into a cohesive multi-layered artwork not only boosts productivity but also paves the way for the democratization of graphic design. One existing practice is Graphic Layout Generation (GLG), which aims to layout sequential design elements. It has been constrained by the necessity for a predefined correct sequence of layers, thus limiting creative potential and increasing user workload. In this paper, we present Hierarchical Layout Generation (HLG) as a more flexible and pragmatic setup, which creates graphic composition from unordered sets of design elements. To tackle the HLG task, we introduce Graphist, the first layout generation model based on large multimodal models. Graphist efficiently reframes the HLG as a sequence generation problem, utilizing RGB-A images as input, outputs a JSON draft protocol, indicating the coordinates, size, and order of each element. We develop new evaluation metrics for HLG. Graphist outperforms prior arts and establishes a strong baseline for this field. Project homepage: this https URL
- [1835] arXiv:2404.14370 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Assessing GPT-4-Vision's Capabilities in UML-Based Code GenerationSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Abstract: The emergence of advanced neural networks has opened up new ways in automated code generation from conceptual models, promising to enhance software development processes. This paper presents a preliminary evaluation of GPT-4-Vision, a state-of-the-art deep learning model, and its capabilities in transforming Unified Modeling Language (UML) class diagrams into fully operating Java class files. In our study, we used exported images of 18 class diagrams comprising 10 single-class and 8 multi-class diagrams. We used 3 different prompts for each input, and we manually evaluated the results. We created a scoring system in which we scored the occurrence of elements found in the diagram within the source code. On average, the model was able to generate source code for 88% of the elements shown in the diagrams. Our results indicate that GPT-4-Vision exhibits proficiency in handling single-class UML diagrams, successfully transforming them into syntactically correct class files. However, for multi-class UML diagrams, the model's performance is weaker compared to single-class diagrams. In summary, further investigations are necessary to exploit the model's potential completely.
- [1836] arXiv:2404.14372 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond Scaling: Predicting Patent Approval with Domain-specific Fine-grained Claim Dependency GraphComments: 17 Pages, Under ReviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Model scaling is becoming the default choice for many language tasks due to the success of large language models (LLMs). However, it can fall short in specific scenarios where simple customized methods excel. In this paper, we delve into the patent approval pre-diction task and unveil that simple domain-specific graph methods outperform enlarging the model, using the intrinsic dependencies within the patent data. Specifically, we first extend the embedding-based state-of-the-art (SOTA) by scaling up its backbone model with various sizes of open-source LLMs, then explore prompt-based methods to harness proprietary LLMs' potential, but find the best results close to random guessing, underlining the ineffectiveness of model scaling-up. Hence, we propose a novel Fine-grained cLAim depeNdency (FLAN) Graph through meticulous patent data analyses, capturing the inherent dependencies across segments of the patent text. As it is model-agnostic, we apply cost-effective graph models to our FLAN Graph to obtain representations for approval prediction. Extensive experiments and detailed analyses prove that incorporating FLAN Graph via various graph models consistently outperforms all LLM baselines significantly. We hope that our observations and analyses in this paper can bring more attention to this challenging task and prompt further research into the limitations of LLMs. Our source code and dataset can be obtained from this http URL .
- [1837] arXiv:2404.14387 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Survey on Self-Evolution of Large Language ModelsZhengwei Tao , Ting-En Lin , Xiancai Chen , Hangyu Li , Yuchuan Wu , Yongbin Li , Zhi Jin , Fei Huang , Dacheng Tao , Jingren ZhouSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have significantly advanced in various fields and intelligent agent applications. However, current LLMs that learn from human or external model supervision are costly and may face performance ceilings as task complexity and diversity increase. To address this issue, self-evolution approaches that enable LLM to autonomously acquire, refine, and learn from experiences generated by the model itself are rapidly growing. This new training paradigm inspired by the human experiential learning process offers the potential to scale LLMs towards superintelligence. In this work, we present a comprehensive survey of self-evolution approaches in LLMs. We first propose a conceptual framework for self-evolution and outline the evolving process as iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation. Second, we categorize the evolution objectives of LLMs and LLM-based agents; then, we summarize the literature and provide taxonomy and insights for each module. Lastly, we pinpoint existing challenges and propose future directions to improve self-evolution frameworks, equipping researchers with critical insights to fast-track the development of self-evolving LLMs.
- [1838] arXiv:2404.14395 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PARAMANU-GANITA: Language Model with Mathematical CapabilitiesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we present Paramanu-Ganita, a 208 million parameter novel Auto Regressive (AR) decoder based language model on mathematics. The model is pretrained from scratch at context size of 4096 on our curated mixed mathematical corpus. We evaluate our model on both perplexity metric and GSM8k mathematical benchmark. Paramanu-Ganita despite being 35 times smaller than 7B LLMs, outperformed generalist LLMs such as LLaMa-1 7B by 28.4% points, LLaMa-2 7B by 27.6% points, Falcon 7B by 32.6% points, PaLM 8B by 35.3% points, and math specialised LLMs such as Minerva 8B by 23.2% points, and LLEMMA-7B by 3.0% points in GSM8k test accuracy metric respectively. Paramanu-Ganita also outperformed giant LLMs like PaLM 62B by 6.4% points, Falcon 40B by 19.8% points, LLaMa-1 33B by 3.8% points and Vicuna 13B by 11.8% points respectively. The large significant margin improvement in performance of our math model over the existing LLMs signifies that reasoning capabilities of language model are just not restricted to LLMs with humongous number of parameters. Paramanu-Ganita took 146 hours of A100 training whereas math specialised LLM, LLEMMA 7B, was trained for 23,000 A100 hours of training equivalent. Thus, our approach of pretraining powerful domain specialised language models from scratch for domain adaptation is much more cost-effective than performing continual training of LLMs for domain adaptation. Hence, we conclude that for strong mathematical reasoning abilities of language model, we do not need giant LLMs and immense computing power to our end. In the end, we want to point out that we have only trained Paramanu-Ganita only on a part of our entire mathematical corpus and yet to explore the full potential of our model.
- [1839] arXiv:2404.14408 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SpaceByte: Towards Deleting Tokenization from Large Language ModelingComments: 9+9 pages, 3+1 figures, 2+4 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.
- [1840] arXiv:2404.14416 (cross-list from physics.geo-ph) [ pdf , ps , html , other ]
-
Title: Conditional diffusion models for downscaling & bias correction of Earth system model precipitationSubjects: Geophysics (physics.geo-ph) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: Climate change exacerbates extreme weather events like heavy rainfall and flooding. As these events cause severe losses of property and lives, accurate high-resolution simulation of precipitation is imperative. However, existing Earth System Models (ESMs) struggle with resolving small-scale dynamics and suffer from biases, especially for extreme events. Traditional statistical bias correction and downscaling methods fall short in improving spatial structure, while recent deep learning methods lack controllability over the output and suffer from unstable training. Here, we propose a novel machine learning framework for simultaneous bias correction and downscaling. We train a generative diffusion model in a supervised way purely on observational data. We map observational and ESM data to a shared embedding space, where both are unbiased towards each other and train a conditional diffusion model to reverse the mapping. Our method can be used to correct any ESM field, as the training is independent of the ESM. Our approach ensures statistical fidelity, preserves large-scale spatial patterns and outperforms existing methods especially regarding extreme events and small-scale spatial features that are crucial for impact assessments.
- [1841] arXiv:2404.14418 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Mitigating Cascading Effects in Large Adversarial Graph EnvironmentsComments: 10 pages, 7 figuresSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: A significant amount of society's infrastructure can be modeled using graph structures, from electric and communication grids, to traffic networks, to social networks. Each of these domains are also susceptible to the cascading spread of negative impacts, whether this be overloaded devices in the power grid or the reach of a social media post containing misinformation. The potential harm of a cascade is compounded when considering a malicious attack by an adversary that is intended to maximize the cascading impact. However, by exploiting knowledge of the cascading dynamics, targets with the largest cascading impact can be preemptively prioritized for defense, and the damage an adversary can inflict can be mitigated. While game theory provides tools for finding an optimal preemptive defense strategy, existing methods struggle to scale to the context of large graph environments because of the combinatorial explosion of possible actions that occurs when the attacker and defender can each choose multiple targets in the graph simultaneously. The proposed method enables a data-driven deep learning approach that uses multi-node representation learning and counterfactual data augmentation to generalize to the full combinatorial action space by training on a variety of small restricted subsets of the action space. We demonstrate through experiments that the proposed method is capable of identifying defense strategies that are less exploitable than SOTA methods for large graphs, while still being able to produce strategies near the Nash equilibrium for small-scale scenarios for which it can be computed. Moreover, the proposed method demonstrates superior prediction accuracy on a validation set of unseen cascades compared to other deep learning approaches.
- [1842] arXiv:2404.14432 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Monitoring Critical Infrastructure Facilities During Disasters Using Large Language ModelsComments: Accepted to appear at the 2024 ISCRAM conferenceSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Abstract: Critical Infrastructure Facilities (CIFs), such as healthcare and transportation facilities, are vital for the functioning of a community, especially during large-scale emergencies. In this paper, we explore a potential application of Large Language Models (LLMs) to monitor the status of CIFs affected by natural disasters through information disseminated in social media networks. To this end, we analyze social media data from two disaster events in two different countries to identify reported impacts to CIFs as well as their impact severity and operational status. We employ state-of-the-art open-source LLMs to perform computational tasks including retrieval, classification, and inference, all in a zero-shot setting. Through extensive experimentation, we report the results of these tasks using standard evaluation metrics and reveal insights into the strengths and weaknesses of LLMs. We note that although LLMs perform well in classification tasks, they encounter challenges with inference tasks, especially when the context/prompt is complex and lengthy. Additionally, we outline various potential directions for future exploration that can be beneficial during the initial adoption phase of LLMs for disaster response tasks.
- [1843] arXiv:2404.14441 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Optimizing Contrail Detection: A Deep Learning Approach with EfficientNet-b4 EncodingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: In the pursuit of environmental sustainability, the aviation industry faces the challenge of minimizing its ecological footprint. Among the key solutions is contrail avoidance, targeting the linear ice-crystal clouds produced by aircraft exhaust. These contrails exacerbate global warming by trapping atmospheric heat, necessitating precise segmentation and comprehensive analysis of contrail images to gauge their environmental impact. However, this segmentation task is complex due to the varying appearances of contrails under different atmospheric conditions and potential misalignment issues in predictive modeling. This paper presents an innovative deep-learning approach utilizing the efficient net-b4 encoder for feature extraction, seamlessly integrating misalignment correction, soft labeling, and pseudo-labeling techniques to enhance the accuracy and efficiency of contrail detection in satellite imagery. The proposed methodology aims to redefine contrail image analysis and contribute to the objectives of sustainable aviation by providing a robust framework for precise contrail detection and analysis in satellite imagery, thus aiding in the mitigation of aviation's environmental impact.
- [1844] arXiv:2404.14442 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unified ODE Analysis of Smooth Q-Learning AlgorithmsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Convergence of Q-learning has been the focus of extensive research over the past several decades. Recently, an asymptotic convergence analysis for Q-learning was introduced using a switching system framework. This approach applies the so-called ordinary differential equation (ODE) approach to prove the convergence of the asynchronous Q-learning modeled as a continuous-time switching system, where notions from switching system theory are used to prove its asymptotic stability without using explicit Lyapunov arguments. However, to prove stability, restrictive conditions, such as quasi-monotonicity, must be satisfied for the underlying switching systems, which makes it hard to easily generalize the analysis method to other reinforcement learning algorithms, such as the smooth Q-learning variants. In this paper, we present a more general and unified convergence analysis that improves upon the switching system approach and can analyze Q-learning and its smooth variants. The proposed analysis is motivated by previous work on the convergence of synchronous Q-learning based on $p$-norm serving as a Lyapunov function. However, the proposed analysis addresses more general ODE models that can cover both asynchronous Q-learning and its smooth versions with simpler frameworks.
- [1845] arXiv:2404.14443 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Evaluation of Machine Translation Based on Semantic Dependencies and KeywordsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In view of the fact that most of the existing machine translation evaluation algorithms only consider the lexical and syntactic information, but ignore the deep semantic information contained in the sentence, this paper proposes a computational method for evaluating the semantic correctness of machine translations based on reference translations and incorporating semantic dependencies and sentence keyword information. Use the language technology platform developed by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology to conduct semantic dependency analysis and keyword analysis on sentences, and obtain semantic dependency graphs, keywords, and weight information corresponding to keywords. It includes all word information with semantic dependencies in the sentence and keyword information that affects semantic information. Construct semantic association pairs including word and dependency multi-features. The key semantics of the sentence cannot be highlighted in the semantic information extracted through semantic dependence, resulting in vague semantics analysis. Therefore, the sentence keyword information is also included in the scope of machine translation semantic evaluation. To achieve a comprehensive and in-depth evaluation of the semantic correctness of sentences, the experimental results show that the accuracy of the evaluation algorithm has been improved compared with similar methods, and it can more accurately measure the semantic correctness of machine translation.
- [1846] arXiv:2404.14444 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Practical Battery Health Monitoring using Uncertainty-Aware Bayesian Neural NetworkComments: 6 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Battery health monitoring and prediction are critically important in the era of electric mobility with a huge impact on safety, sustainability, and economic aspects. Existing research often focuses on prediction accuracy but tends to neglect practical factors that may hinder the technology's deployment in real-world applications. In this paper, we address these practical considerations and develop models based on the Bayesian neural network for predicting battery end-of-life. Our models use sensor data related to battery health and apply distributions, rather than single-point, for each parameter of the models. This allows the models to capture the inherent randomness and uncertainty of battery health, which leads to not only accurate predictions but also quantifiable uncertainty. We conducted an experimental study and demonstrated the effectiveness of our proposed models, with a prediction error rate averaging 13.9%, and as low as 2.9% for certain tested batteries. Additionally, all predictions include quantifiable certainty, which improved by 66% from the initial to the mid-life stage of the battery. This research has practical values for battery technologies and contributes to accelerating the technology adoption in the industry.
- [1847] arXiv:2404.14445 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language ModelsComments: 10 pages, 1 figure, 4 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The rapid advancements in generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data, particularly in the realm of structured tabular formats, such as product reviews. Despite the potential benefits, concerns regarding privacy leakage have surfaced, especially when personal information is utilized in the training datasets. In addition, there is an absence of a comprehensive evaluation framework capable of quantitatively measuring the quality of the generated synthetic data and their utility for downstream tasks. In response to this gap, we introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data via a suite of diverse evaluation metrics. We validate the efficacy of our proposed framework - SynEval - by applying it to synthetic product review data generated by three state-of-the-art LLMs: ChatGPT, Claude, and Llama. Our experimental findings illuminate the trade-offs between various evaluation metrics in the context of synthetic data generation. Furthermore, SynEval stands as a critical instrument for researchers and practitioners engaged with synthetic tabular data,, empowering them to judiciously determine the suitability of the generated data for their specific applications, with an emphasis on upholding user privacy.
- [1848] arXiv:2404.14451 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative Subspace Adversarial Active Learning for Outlier Detection in Multiple Views of High-dimensional DataComments: 16 pages, Pre-printSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Outlier detection in high-dimensional tabular data is an important task in data mining, essential for many downstream tasks and applications. Existing unsupervised outlier detection algorithms face one or more problems, including inlier assumption (IA), curse of dimensionality (CD), and multiple views (MV). To address these issues, we introduce Generative Subspace Adversarial Active Learning (GSAAL), a novel approach that uses a Generative Adversarial Network with multiple adversaries. These adversaries learn the marginal class probability functions over different data subspaces, while a single generator in the full space models the entire distribution of the inlier class. GSAAL is specifically designed to address the MV limitation while also handling the IA and CD, being the only method to do so. We provide a comprehensive mathematical formulation of MV, convergence guarantees for the discriminators, and scalability results for GSAAL. Our extensive experiments demonstrate the effectiveness and scalability of GSAAL, highlighting its superior performance compared to other popular OD methods, especially in MV scenarios.
- [1849] arXiv:2404.14453 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EPI-SQL: Enhancing Text-to-SQL Translation with Error-Prevention InstructionsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: The conversion of natural language queries into SQL queries, known as Text-to-SQL, is a critical yet challenging task. This paper introduces EPI-SQL, a novel methodological framework leveraging Large Language Models (LLMs) to enhance the performance of Text-to-SQL tasks. EPI-SQL operates through a four-step process. Initially, the method involves gathering instances from the Spider dataset on which LLMs are prone to failure. These instances are then utilized to generate general error-prevention instructions (EPIs). Subsequently, LLMs craft contextualized EPIs tailored to the specific context of the current task. Finally, these context-specific EPIs are incorporated into the prompt used for SQL generation. EPI-SQL is distinguished in that it provides task-specific guidance, enabling the model to circumvent potential errors for the task at hand. Notably, the methodology rivals the performance of advanced few-shot methods despite being a zero-shot approach. An empirical assessment using the Spider benchmark reveals that EPI-SQL achieves an execution accuracy of 85.1\%, underscoring its effectiveness in generating accurate SQL queries through LLMs. The findings indicate a promising direction for future research, i.e. enhancing instructions with task-specific and contextualized rules, for boosting LLMs' performance in NLP tasks.
- [1850] arXiv:2404.14454 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reinforcement of Explainability of ChatGPT Prompts by Embedding Breast Cancer Self-Screening Rules into AI ResponsesComments: 9 pages, 5 figures, 3 algorithms, 1 table, to be presented as a Poster at the ICCS'24Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Addressing the global challenge of breast cancer, this research explores the fusion of generative AI, focusing on ChatGPT 3.5 turbo model, and the intricacies of breast cancer risk assessment. The research aims to evaluate ChatGPT's reasoning capabilities, emphasizing its potential to process rules and provide explanations for screening recommendations. The study seeks to bridge the technology gap between intelligent machines and clinicians by demonstrating ChatGPT's unique proficiency in natural language reasoning. The methodology employs a supervised prompt-engineering approach to enforce detailed explanations for ChatGPT's recommendations. Synthetic use cases, generated algorithmically, serve as the testing ground for the encoded rules, evaluating the model's processing prowess. Findings highlight ChatGPT's promising capacity in processing rules comparable to Expert System Shells, with a focus on natural language reasoning. The research introduces the concept of reinforcement explainability, showcasing its potential in elucidating outcomes and facilitating user-friendly interfaces for breast cancer risk assessment.
- [1851] arXiv:2404.14455 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Neuro-Symbolic Explainer for Rare Events: A Case Study on Predictive MaintenanceComments: 26 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Predictive Maintenance applications are increasingly complex, with interactions between many components. Black box models are popular approaches based on deep learning techniques due to their predictive accuracy. This paper proposes a neural-symbolic architecture that uses an online rule-learning algorithm to explain when the black box model predicts failures. The proposed system solves two problems in parallel: anomaly detection and explanation of the anomaly. For the first problem, we use an unsupervised state of the art autoencoder. For the second problem, we train a rule learning system that learns a mapping from the input features to the autoencoder reconstruction error. Both systems run online and in parallel. The autoencoder signals an alarm for the examples with a reconstruction error that exceeds a threshold. The causes of the signal alarm are hard for humans to understand because they result from a non linear combination of sensor data. The rule that triggers that example describes the relationship between the input features and the autoencoder reconstruction error. The rule explains the failure signal by indicating which sensors contribute to the alarm and allowing the identification of the component involved in the failure. The system can present global explanations for the black box model and local explanations for why the black box model predicts a failure. We evaluate the proposed system in a real-world case study of Metro do Porto and provide explanations that illustrate its benefits.
- [1852] arXiv:2404.14459 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: LLMs in Web-Development: Evaluating LLM-Generated PHP code unveiling vulnerabilities and limitationsSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: This research carries out a comprehensive examination of web application code security, when generated by Large Language Models through analyzing a dataset comprising 2,500 small dynamic PHP websites. These AI-generated sites are scanned for security vulnerabilities after being deployed as standalone websites in Docker containers. The evaluation of the websites was conducted using a hybrid methodology, incorporating the Burp Suite active scanner, static analysis, and manual checks. Our investigation zeroes in on identifying and analyzing File Upload, SQL Injection, Stored XSS, and Reflected XSS. This approach not only underscores the potential security flaws within AI-generated PHP code but also provides a critical perspective on the reliability and security implications of deploying such code in real-world scenarios. Our evaluation confirms that 27% of the programs generated by GPT-4 verifiably contains vulnerabilities in the PHP code, where this number -- based on static scanning and manual verification -- is potentially much higher. This poses a substantial risks to software safety and security. In an effort to contribute to the research community and foster further analysis, we have made the source codes publicly available, alongside a record enumerating the detected vulnerabilities for each sample. This study not only sheds light on the security aspects of AI-generated code but also underscores the critical need for rigorous testing and evaluation of such technologies for software development.
- [1853] arXiv:2404.14461 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMsJavier Rando , Francesco Croce , Kryštof Mitka , Stepan Shabalin , Maksym Andriushchenko , Nicolas Flammarion , Florian TramèrComments: Competition ReportSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.
- [1854] arXiv:2404.14463 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical InterviewsSergio Burdisso , Ernesto Reyes-Ramírez , Esaú Villatoro-Tello , Fernando Sánchez-Vega , Pastor López-Monroy , Petr MotlicekComments: Accepted to Clinical NLP workshop at NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Automatic depression detection from conversational data has gained significant interest in recent years. The DAIC-WOZ dataset, interviews conducted by a human-controlled virtual agent, has been widely used for this task. Recent studies have reported enhanced performance when incorporating interviewer's prompts into the model. In this work, we hypothesize that this improvement might be mainly due to a bias present in these prompts, rather than the proposed architectures and methods. Through ablation experiments and qualitative analysis, we discover that models using interviewer's prompts learn to focus on a specific region of the interviews, where questions about past experiences with mental health issues are asked, and use them as discriminative shortcuts to detect depressed participants. In contrast, models using participant responses gather evidence from across the entire interview. Finally, to highlight the magnitude of this bias, we achieve a 0.90 F1 score by intentionally exploiting it, the highest result reported to date on this dataset using only textual information. Our findings underline the need for caution when incorporating interviewers' prompts into models, as they may inadvertently learn to exploit targeted prompts, rather than learning to characterize the language and behavior that are genuinely indicative of the patient's mental health condition.
- [1855] arXiv:2404.14464 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question AnsweringComments: Keywords: Muti-hop Question Answering; Retrieval-Augmented Generation; Tree of Thought; Reasoning TLDR: We proposed a tree-based dynamic, iterative retrieval framework for multi-hop question answeringSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works have introduced retrieval-augmentation in the CoT reasoning to solve multi-hop question answering. However, these chain methods have the following problems: 1) Retrieved irrelevant paragraphs may mislead the reasoning; 2) An error in the chain structure may lead to a cascade of errors.
In this paper, we propose a dynamic retrieval framework called Tree of Reviews (ToR), where the root node is the question, and the other nodes are paragraphs from retrieval, extending different reasoning paths from the root node to other nodes. Our framework dynamically decides to initiate a new search, reject, or accept based on the paragraphs on the reasoning paths. Compared to related work, we introduce a tree structure to handle each retrieved paragraph separately, alleviating the misleading effect of irrelevant paragraphs on the reasoning path; the diversity of reasoning path extension reduces the impact of a single reasoning error on the whole. We conducted experiments on three different multi-hop question answering datasets. The results show that compared to the baseline methods, ToR achieves state-of-the-art performance in both retrieval and response generation. In addition, we propose two tree-based search optimization strategies, pruning and effective expansion, to reduce time overhead and increase the diversity of path extension. We will release our code. - [1856] arXiv:2404.14465 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional ApproachesDimitris Asimopoulos , Ilias Siniosoglou , Vasileios Argyriou , Thomai Karamitsou , Eleftherios Fountoukidis , Sotirios K. Goudos , Ioannis D. Moscholios , Konstantinos E. Psannis , Panagiotis SarigiannidisSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: In the realm of data privacy, the ability to effectively anonymise text is paramount. With the proliferation of deep learning and, in particular, transformer architectures, there is a burgeoning interest in leveraging these advanced models for text anonymisation tasks. This paper presents a comprehensive benchmarking study comparing the performance of transformer-based models and Large Language Models(LLM) against traditional architectures for text anonymisation. Utilising the CoNLL-2003 dataset, known for its robustness and diversity, we evaluate several models. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods. Notably, while modern models exhibit advanced capabilities in capturing con textual nuances, certain traditional architectures still keep high performance. This work aims to guide researchers in selecting the most suitable model for their anonymisation needs, while also shedding light on potential paths for future advancements in the field.
- [1857] arXiv:2404.14467 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Integrating Chemistry Knowledge in Large Language Models via Prompt EngineeringComments: 43 pages, 17 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a study on the integration of domain-specific knowledge in prompt engineering to enhance the performance of large language models (LLMs) in scientific domains. A benchmark dataset is curated to encapsulate the intricate physical-chemical properties of small molecules, their drugability for pharmacology, alongside the functional attributes of enzymes and crystal materials, underscoring the relevance and applicability across biological and chemical domains.The proposed domain-knowledge embedded prompt engineering method outperforms traditional prompt engineering strategies on various metrics, including capability, accuracy, F1 score, and hallucination drop. The effectiveness of the method is demonstrated through case studies on complex materials including the MacMillan catalyst, paclitaxel, and lithium cobalt oxide. The results suggest that domain-knowledge prompts can guide LLMs to generate more accurate and relevant responses, highlighting the potential of LLMs as powerful tools for scientific discovery and innovation when equipped with domain-specific prompts. The study also discusses limitations and future directions for domain-specific prompt engineering development.
- [1858] arXiv:2404.14469 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SnapKV: LLM Knows What You are Looking for Before GenerationYuhong Li , Yingbing Huang , Bowen Yang , Bharat Venkitesh , Acyr Locatelli , Hanchen Ye , Tianle Cai , Patrick Lewis , Deming ChenSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications.
We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications. - [1859] arXiv:2404.14547 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Integrating Disambiguation and User Preferences into Large Language Models for Robot Motion PlanningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: This paper presents a framework that can interpret humans' navigation commands containing temporal elements and directly translate their natural language instructions into robot motion planning. Central to our framework is utilizing Large Language Models (LLMs). To enhance the reliability of LLMs in the framework and improve user experience, we propose methods to resolve the ambiguity in natural language instructions and capture user preferences. The process begins with an ambiguity classifier, identifying potential uncertainties in the instructions. Ambiguous statements trigger a GPT-4-based mechanism that generates clarifying questions, incorporating user responses for disambiguation. Also, the framework assesses and records user preferences for non-ambiguous instructions, enhancing future interactions. The last part of this process is the translation of disambiguated instructions into a robot motion plan using Linear Temporal Logic. This paper details the development of this framework and the evaluation of its performance in various test scenarios.
- [1860] arXiv:2404.14552 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.
- [1861] arXiv:2404.14575 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: Designing forecasting software for forecast users: Empowering non-experts to create and understand their own forecastsRichard Stromer (1), Oskar Triebe (1), Chad Zanocco (1), Ram Rajagopal (1) ((1) Stanford University)Comments: 10 pagesJournal-ref: AMCIS 2023 Proceedings 1Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Forecasts inform decision-making in nearly every domain. Forecasts are often produced by experts with rare or hard to acquire skills. In practice, forecasts are often used by domain experts and managers with little forecasting expertise. Our study focuses on how to design forecasting software that empowers non-expert users. We study how users can make use of state-of-the-art forecasting methods, embed their domain knowledge, and how they build understanding and trust towards generated forecasts. To do so, we co-designed a forecasting software prototype using feedback from users and then analyzed their interactions with our prototype. Our results identified three main considerations for non-expert users: (1) a safe stepwise approach facilitating causal understanding and trust; (2) a white box model supporting human-reasoning-friendly components; (3) the inclusion of domain knowledge. This paper contributes insights into how non-expert users interact with forecasting software and by recommending ways to design more accessible forecasting software.
- [1862] arXiv:2404.14581 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: The Adversarial AI-Art: Understanding, Generation, Detection, and BenchmarkingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Generative AI models can produce high-quality images based on text prompts. The generated images often appear indistinguishable from images generated by conventional optical photography devices or created by human artists (i.e., real images). While the outstanding performance of such generative models is generally well received, security concerns arise. For instance, such image generators could be used to facilitate fraud or scam schemes, generate and spread misinformation, or produce fabricated artworks. In this paper, we present a systematic attempt at understanding and detecting AI-generated images (AI-art) in adversarial scenarios. First, we collect and share a dataset of real images and their corresponding artificial counterparts generated by four popular AI image generators. The dataset, named ARIA, contains over 140K images in five categories: artworks (painting), social media images, news photos, disaster scenes, and anime pictures. This dataset can be used as a foundation to support future research on adversarial AI-art. Next, we present a user study that employs the ARIA dataset to evaluate if real-world users can distinguish with or without reference images. In a benchmarking study, we further evaluate if state-of-the-art open-source and commercial AI image detectors can effectively identify the images in the ARIA dataset. Finally, we present a ResNet-50 classifier and evaluate its accuracy and transferability on the ARIA dataset.
- [1863] arXiv:2404.14606 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing ClassificationJournal-ref: Journal of Computer Technology and Applied Mathematics, vol. 1, no. 1, Apr. 2024, pp. 46-53,Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: With wearing masks becoming a new cultural norm, facial expression recognition (FER) while taking masks into account has become a significant challenge. In this paper, we propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks. Our approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations. Furthermore, we propose a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. Our proposed framework reduces the overall complexity compared with using separate networks for both tasks by the simple yet effective cross-task fusion phase. Extensive experiments demonstrate that our proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification task.
- [1864] arXiv:2404.14618 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Hybrid LLM: Cost-Efficient and Quality-Aware Query RoutingDujian Ding , Ankur Mallick , Chi Wang , Robert Sim , Subhabrata Mukherjee , Victor Ruhle , Laks V.S. Lakshmanan , Ahmed Hassan AwadallahComments: Accepted to ICLR 2024 (main conference)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
- [1865] arXiv:2404.14619 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: OpenELM: An Efficient Language Model Family with Open Training and Inference FrameworkSachin Mehta , Mohammad Hossein Sekhavat , Qingqing Cao , Maxwell Horton , Yanzi Jin , Chenfan Sun , Iman Mirzadeh , Mahyar Najibi , Dmitry Belenko , Peter Zatloukal , Mohammad RastegariComments: Minor correctionsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring $2\times$ fewer pre-training tokens.
Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors.
Our source code along with pre-trained model weights and training recipes is available at \url{ this https URL }. Additionally, \model models can be found on HuggingFace at: \url{ this https URL }. - [1866] arXiv:2404.14635 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Digital Twins for forecasting and decision optimisation with machine learning: applications in wastewater treatmentComments: A bit thin, but an interesting application of ML methods for decision makingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Prediction and optimisation are two widely used techniques that have found many applications in solving real-world problems. While prediction is concerned with estimating the unknown future values of a variable, optimisation is concerned with optimising the decision given all the available data. These methods are used together to solve problems for sequential decision-making where often we need to predict the future values of variables and then use them for determining the optimal decisions. This paradigm is known as forecast and optimise and has numerous applications, e.g., forecast demand for a product and then optimise inventory, forecast energy demand and schedule generations, forecast demand for a service and schedule staff, to name a few. In this extended abstract, we review a digital twin that was developed and applied in wastewater treatment in Urban Utility to improve their operational efficiency. While the current study is tailored to the case study problem, the underlying principles can be used to solve similar problems in other domains.
- [1867] arXiv:2404.14646 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationZhen Yang , Fang Liu , Zhongxing Yu , Jacky Wai Keung , Jia Li , Shuo Liu , Yifan Hong , Xiaoxue Ma , Zhi Jin , Ge LiComments: 23 pages, 7 figures, accepted by FSE'24 (2024 ACM International Conference on the Foundations of Software Engineering)Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Code translation tools are developed for automatic source-to-source translation. Although learning-based transpilers have shown impressive enhancement against rule-based counterparts, owing to their task-specific pre-training on extensive monolingual corpora. Their current performance still remains unsatisfactory for practical deployment, and the associated training resources are also prohibitively expensive. LLMs pre-trained on huge amounts of human-written code/text have shown remarkable performance in many code intelligence tasks due to their powerful generality, even without task-specific training. Thus, LLMs can potentially circumvent the above limitations, but they have not been exhaustively explored yet. This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks, finding that: although certain LLMs have outperformed current transpilers, they still have some accuracy issues, where most of the failures are induced by a lack of comprehension of source programs (38.51%), missing clear instructions on I/O types in translation (14.94%), and ignoring discrepancies between source and target programs (41.38%). Enlightened by the above findings, we propose UniTrans, an Unified code Translation framework, applicable to various LLMs, for unleashing their power in this field. Specifically, UniTrans first craft a series of test cases for target programs with the assistance of source programs. Next, it harnesses the above auto-generated test cases to augment the code translation and then evaluate their correctness via execution. Afterward, UniTrans further (iteratively) repairs incorrectly translated programs prompted by test case execution results. Extensive experiments are conducted on six translation datasets between Python, Java, and C++. Three recent LLMs of diverse sizes are tested with UniTrans, and all achieve substantial improvements.
- [1868] arXiv:2404.14660 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: AI Procurement Checklists: Revisiting Implementation in the Age of AI GovernanceSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Public sector use of AI has been quietly on the rise for the past decade, but only recently have efforts to regulate it entered the cultural zeitgeist. While simple to articulate, promoting ethical and effective roll outs of AI systems in government is a notoriously elusive task. On the one hand there are hard-to-address pitfalls associated with AI-based tools, including concerns about bias towards marginalized communities, safety, and gameability. On the other, there is pressure not to make it too difficult to adopt AI, especially in the public sector which typically has fewer resources than the private sector$\unicode{x2014}$conserving scarce government resources is often the draw of using AI-based tools in the first place. These tensions create a real risk that procedures built to ensure marginalized groups are not hurt by government use of AI will, in practice, be performative and ineffective. To inform the latest wave of regulatory efforts in the United States, we look to jurisdictions with mature regulations around government AI use. We report on lessons learned by officials in Brazil, Singapore and Canada, who have collectively implemented risk categories, disclosure requirements and assessments into the way they procure AI tools. In particular, we investigate two implemented checklists: the Canadian Directive on Automated Decision-Making (CDADM) and the World Economic Forum's AI Procurement in a Box (WEF). We detail three key pitfalls around expertise, risk frameworks and transparency, that can decrease the efficacy of regulations aimed at government AI use and suggest avenues for improvement.
- [1869] arXiv:2404.14664 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Employing Layerwised Unsupervised Learning to Lessen Data and Loss Requirements in Forward-Forward AlgorithmsComments: 8 pages, 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Recent deep learning models such as ChatGPT utilizing the back-propagation algorithm have exhibited remarkable performance. However, the disparity between the biological brain processes and the back-propagation algorithm has been noted. The Forward-Forward algorithm, which trains deep learning models solely through the forward pass, has emerged to address this. Although the Forward-Forward algorithm cannot replace back-propagation due to limitations such as having to use special input and loss functions, it has the potential to be useful in special situations where back-propagation is difficult to use. To work around this limitation and verify usability, we propose an Unsupervised Forward-Forward algorithm. Using an unsupervised learning model enables training with usual loss functions and inputs without restriction. Through this approach, we lead to stable learning and enable versatile utilization across various datasets and tasks. From a usability perspective, given the characteristics of the Forward-Forward algorithm and the advantages of the proposed method, we anticipate its practical application even in scenarios such as federated learning, where deep learning layers need to be trained separately in physically distributed environments.
- [1870] arXiv:2404.14674 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HOIN: High-Order Implicit Neural RepresentationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract: Implicit neural representations (INR) suffer from worsening spectral bias, which results in overly smooth solutions to the inverse problem. To deal with this problem, we propose a universal framework for processing inverse problems called \textbf{High-Order Implicit Neural Representations (HOIN)}. By refining the traditional cascade structure to foster high-order interactions among features, HOIN enhances the model's expressive power and mitigates spectral bias through its neural tangent kernel's (NTK) strong diagonal properties, accelerating and optimizing inverse problem resolution. By analyzing the model's expression space, high-order derivatives, and the NTK matrix, we theoretically validate the feasibility of HOIN. HOIN realizes 1 to 3 dB improvements in most inverse problems, establishing a new state-of-the-art recovery quality and training efficiency, thus providing a new general paradigm for INR and paving the way for it to solve the inverse problem.
- [1871] arXiv:2404.14680 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Automated Multi-Language to English Machine Translation Using Generative Pre-Trained TransformersSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The task of accurate and efficient language translation is an extremely important information processing task. Machine learning enabled and automated translation that is accurate and fast is often a large topic of interest in the machine learning and data science communities. In this study, we examine using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English using translated TED Talk transcripts as the reference dataset. These GPT model inference calls are performed strictly locally, on single A100 Nvidia GPUs. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation. The best overall performing GPT model for translating into English text for the BLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.152$, for the GLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.256$, for the chrF metric is Llama2-chat-AYT-13B with a mean score across all tested languages of $0.448$, and for the METEOR metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.438$.
- [1872] arXiv:2404.14687 (cross-list from cs.MM) [ pdf , ps , html , other ]
-
Title: Pegasus-v1 Technical ReportRaehyuk Jung , Hyojun Go , Jaehyuk Yi , Jiho Jang , Daniel Kim , Jay Suh , Aiden Lee , Cooper Han , Jae Lee , Jeff Kim , Jin-Young Kim , Junwan Kim , Kyle Park , Lucas Lee , Mars Ha , Minjoon Seo , Abraham Jo , Ed Park , Hassan Kianinejad , SJ Kim , Tony Moon , Wade Jeong , Andrei Popescu , Esther Kim , EK Yoon , Genie Heo , Henry Choi , Jenna Kang , Kevin Han , Noah Seo , Sunny Nguyen , Ryan Won , Yeonhoo Park , Anthony Giuliani , Dave Chung , Hans Yoon , James Le , Jenny Ahn , June Lee , Maninder Saini , Meredith Sanders , Soyoung Lee , Sue Kim , Travis CoutureSubjects: Multimedia (cs.MM) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.
- [1873] arXiv:2404.14688 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FMint: Bridging Human Designed and Data Pretrained Models for Differential Equation Foundation ModelSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
Abstract: Human-designed algorithms have long been fundamental in solving a variety of scientific and engineering challenges. Recently, data-driven deep learning methods have also risen to prominence, offering innovative solutions across numerous scientific fields. While traditional algorithms excel in capturing the core aspects of specific problems, they often lack the flexibility needed for varying problem conditions due to the absence of specific data. Conversely, while data-driven approaches utilize vast datasets, they frequently fall short in domain-specific knowledge. To bridge these gaps, we introduce \textbf{FMint} (Foundation Model based on Initialization), a generative pre-trained model that synergizes the precision of human-designed algorithms with the adaptability of data-driven methods. This model is specifically engineered for high-accuracy simulation of dynamical systems. Starting from initial trajectories provided by conventional methods, FMint quickly delivers highly accurate solutions. It incorporates in-context learning and has been pre-trained on a diverse corpus of 500,000 dynamical systems, showcasing exceptional generalization across a broad spectrum of real-world applications. By effectively combining algorithmic rigor with data-driven flexibility, FMint sets the stage for the next generation of scientific foundation models, tackling complex problems with both efficiency and high accuracy.
- [1874] arXiv:2404.14700 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: FlashSpeech: Efficient Zero-Shot Speech SynthesisZhen Ye , Zeqian Ju , Haohe Liu , Xu Tan , Jianyi Chen , Yiwen Lu , Peiwen Sun , Jiahao Pan , Weizhen Bian , Shulin He , Qifeng Liu , Yike Guo , Wei XueComments: Efficient zero-shot speech synthesisSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL .
- [1875] arXiv:2404.14712 (cross-list from physics.ao-ph) [ pdf , ps , html , other ]
-
Title: ORBIT: Oak Ridge Base Foundation Model for Earth System PredictabilityXiao Wang , Aristeidis Tsaris , Siyan Liu , Jong-Youl Choi , Ming Fan , Wei Zhang , Junqi Yin , Moetasim Ashfaq , Dan Lu , Prasanna BalaprakashSubjects: Atmospheric and Oceanic Physics (physics.ao-ph) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV); Geophysics (physics.geo-ph)
Abstract: Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision-transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 230 to 707 PFLOPS, with scaling efficiency maintained at 78% to 96% across 24,576 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve the Earth system predictability.
- [1876] arXiv:2404.14716 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual ModalitiesComments: 16 pages, 6 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.
- [1877] arXiv:2404.14736 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: Qualitative Approaches to Voice UXJournal-ref: ACM Computing Surveys (2024)Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Voice is a natural mode of expression offered by modern computer-based systems. Qualitative perspectives on voice-based user experiences (voice UX) offer rich descriptions of complex interactions that numbers alone cannot fully represent. We conducted a systematic review of the literature on qualitative approaches to voice UX, capturing the nature of this body of work in a systematic map and offering a qualitative synthesis of findings. We highlight the benefits of qualitative methods for voice UX research, identify opportunities for increasing rigour in methods and outcomes, and distill patterns of experience across a diversity of devices and modes of qualitative praxis.
- [1878] arXiv:2404.14741 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Generate-on-Graph: Treat LLM as both Agent and KG in Incomplete Knowledge Graph Question AnsweringSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: To address the issue of insufficient knowledge and the tendency to generate hallucination in Large Language Models (LLMs), numerous studies have endeavored to integrate LLMs with Knowledge Graphs (KGs). However, all these methods are evaluated on conventional Knowledge Graph Question Answering (KGQA) with complete KGs, where the factual triples involved in each question are entirely covered by the given KG. In this situation, LLM mainly acts as an agent to find answer entities by exploring the KG, rather than effectively integrating internal and external knowledge sources. However, in real-world scenarios, KGs are often incomplete to cover all the knowledge required to answer questions. To simulate real-world scenarios and evaluate the ability of LLMs to integrate internal and external knowledge, in this paper, we propose leveraging LLMs for QA under Incomplete Knowledge Graph (IKGQA), where the given KG doesn't include all the factual triples involved in each question. To handle IKGQA, we propose a training-free method called Generate-on-Graph (GoG) that can generate new factual triples while exploring on KGs. Specifically, we propose a selecting-generating-answering framework, which not only treat the LLM as an agent to explore on KGs, but also treat it as a KG to generate new facts based on the explored subgraph and its inherent knowledge. Experimental results on two datasets demonstrate that our GoG can solve IKGQA to a certain extent, while almost all previous methods cannot perform well on IKGQA.
- [1879] arXiv:2404.14746 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Customer Level Fraudulent Activity Detection Benchmark for Enhancing Machine Learning Model Research and EvaluationComments: 12 pages, 3 figures, 1 tableSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: In the field of fraud detection, the availability of comprehensive and privacy-compliant datasets is crucial for advancing machine learning research and developing effective anti-fraud systems. Traditional datasets often focus on transaction-level information, which, while useful, overlooks the broader context of customer behavior patterns that are essential for detecting sophisticated fraud schemes. The scarcity of such data, primarily due to privacy concerns, significantly hampers the development and testing of predictive models that can operate effectively at the customer level. Addressing this gap, our study introduces a benchmark that contains structured datasets specifically designed for customer-level fraud detection. The benchmark not only adheres to strict privacy guidelines to ensure user confidentiality but also provides a rich source of information by encapsulating customer-centric features. We have developed the benchmark that allows for the comprehensive evaluation of various machine learning models, facilitating a deeper understanding of their strengths and weaknesses in predicting fraudulent activities. Through this work, we seek to bridge the existing gap in data availability, offering researchers and practitioners a valuable resource that empowers the development of next-generation fraud detection techniques.
- [1880] arXiv:2404.14750 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Grounded Knowledge-Enhanced Medical VLP for Chest X-RayQiao Deng , Zhongzhen Huang , Yunqi Wang , Zhichuan Wang , Zhao Wang , Xiaofan Zhang , Qi Dou , Yeung Yu Hui , Edward S.HuiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit the global and local alignment between medical image and text could however be marred by the redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge is grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between anatomical region-level visual features and the textural features of medical knowledge. The performance of GK-MVLP is competitive with or exceeds the state of the art on downstream chest X-ray disease classification, disease localization, report generation, and medical visual question-answering tasks. Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
- [1881] arXiv:2404.14754 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Skip the Benchmark: Generating System-Level High-Level Synthesis Data using Generative Machine LearningComments: Accepted at Great Lakes Symposium on VLSI 2024 (GLSVLSI 24)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: High-Level Synthesis (HLS) Design Space Exploration (DSE) is a widely accepted approach for efficiently exploring Pareto-optimal and optimal hardware solutions during the HLS process. Several HLS benchmarks and datasets are available for the research community to evaluate their methodologies. Unfortunately, these resources are limited and may not be sufficient for complex, multi-component system-level explorations. Generating new data using existing HLS benchmarks can be cumbersome, given the expertise and time required to effectively generate data for different HLS designs and directives. As a result, synthetic data has been used in prior work to evaluate system-level HLS DSE. However, the fidelity of the synthetic data to real data is often unclear, leading to uncertainty about the quality of system-level HLS DSE. This paper proposes a novel approach, called Vaegan, that employs generative machine learning to generate synthetic data that is robust enough to support complex system-level HLS DSE experiments that would be unattainable with only the currently available data. We explore and adapt a Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) for this task and evaluate our approach using state-of-the-art datasets and metrics. We compare our approach to prior works and show that Vaegan effectively generates synthetic HLS data that closely mirrors the ground truth's distribution.
- [1882] arXiv:2404.14755 (cross-list from cs.MM) [ pdf , ps , html , other ]
-
Title: SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language ModelsBo Lin , Yingjing Xu , Xuanwen Bao , Zhou Zhao , Zuyong Zhang , Zhouyang Wang , Jie Zhang , Shuiguang Deng , Jianwei YinSubjects: Multimedia (cs.MM) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Abstract: With the continuous advancement of vision language models (VLMs) technology, remarkable research achievements have emerged in the dermatology field, the fourth most prevalent human disease category. However, despite these advancements, VLM still faces "hallucination" in dermatological diagnosis, and due to the inherent complexity of dermatological conditions, existing tools offer relatively limited support for user comprehension. We propose SkinGEN, a diagnosis-to-generation framework that leverages the stable diffusion (SD) method to generate reference demonstrations from diagnosis results provided by VLM, thereby enhancing the visual explainability for users. Through extensive experiments with Low-Rank Adaptation (LoRA), we identify optimal strategies for skin condition image generation. We conduct a user study with 32 participants evaluating both the system performance and explainability. Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process. This work paves the way for more transparent and user-centric VLM applications in dermatology and beyond.
- [1883] arXiv:2404.14757 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Integrating Mamba and Transformer for Long-Short Range Time Series ForecastingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Time series forecasting is an important problem and plays a key role in a variety of applications including weather forecasting, stock market, and scientific simulations. Although transformers have proven to be effective in capturing dependency, its quadratic complexity of attention mechanism prevents its further adoption in long-range time series forecasting, thus limiting them attend to short-range range. Recent progress on state space models (SSMs) have shown impressive performance on modeling long range dependency due to their subquadratic complexity. Mamba, as a representative SSM, enjoys linear time complexity and has achieved strong scalability on tasks that requires scaling to long sequences, such as language, audio, and genomics. In this paper, we propose to leverage a hybrid framework Mambaformer that internally combines Mamba for long-range dependency, and Transformer for short range dependency, for long-short range forecasting. To the best of our knowledge, this is the first paper to combine Mamba and Transformer architecture in time series data. We investigate possible hybrid architectures to combine Mamba layer and attention layer for long-short range time series forecasting. The comparative study shows that the Mambaformer family can outperform Mamba and Transformer in long-short range time series forecasting problem. The code is available at this https URL .
- [1884] arXiv:2404.14760 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Retrieval Augmented Generation for Domain-specific Question AnsweringSanat Sharma , David Seunghyun Yoon , Franck Dernoncourt , Dewang Sultania , Karishma Bagga , Mengjiao Zhang , Trung Bui , Varun KotteComments: AAAI 2024 (Association for the Advancement of Artificial Intelligence) Scientific Document Understanding WorkshopSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.
- [1885] arXiv:2404.14763 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Evolutionary Reinforcement Learning via Cooperative CoevolutionSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Recently, evolutionary reinforcement learning has obtained much attention in various domains. Maintaining a population of actors, evolutionary reinforcement learning utilises the collected experiences to improve the behaviour policy through efficient exploration. However, the poor scalability of genetic operators limits the efficiency of optimising high-dimensional neural networks. To address this issue, this paper proposes a novel cooperative coevolutionary reinforcement learning (CoERL) algorithm. Inspired by cooperative coevolution, CoERL periodically and adaptively decomposes the policy optimisation problem into multiple subproblems and evolves a population of neural networks for each of the subproblems. Instead of using genetic operators, CoERL directly searches for partial gradients to update the policy. Updating policy with partial gradients maintains consistency between the behaviour spaces of parents and offspring across generations. The experiences collected by the population are then used to improve the entire policy, which enhances the sampling efficiency. Experiments on six benchmark locomotion tasks demonstrate that CoERL outperforms seven state-of-the-art algorithms and baselines. Ablation study verifies the unique contribution of CoERL's core ingredients.
- [1886] arXiv:2404.14771 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Music Style Transfer With Diffusion ModelComments: 8 pages, 6 figures, ICMC 2023Journal-ref: International Computer Music Conference (ICMC 2023) pp. 40-47, October 2023Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI)
Abstract: Previous studies on music style transfer have mainly focused on one-to-one style conversion, which is relatively limited. When considering the conversion between multiple styles, previous methods required designing multiple modes to disentangle the complex style of the music, resulting in large computational costs and slow audio generation. The existing music style transfer methods generate spectrograms with artifacts, leading to significant noise in the generated audio. To address these issues, this study proposes a music style transfer framework based on diffusion models (DM) and uses spectrogram-based methods to achieve multi-to-multi music style transfer. The GuideDiff method is used to restore spectrograms to high-fidelity audio, accelerating audio generation speed and reducing noise in the generated audio. Experimental results show that our model has good performance in multi-mode music style transfer compared to the baseline and can generate high-quality audio in real-time on consumer-grade GPUs.
- [1887] arXiv:2404.14809 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and ApplicationsComments: 31 pages including references, 22 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: A graph is a fundamental data model to represent various entities and their complex relationships in society and nature, such as social networks, transportation networks, financial networks, and biomedical systems. Recently, large language models (LLMs) have showcased a strong generalization ability to handle various NLP and multi-mode tasks to answer users' arbitrary questions and specific-domain content generation. Compared with graph learning models, LLMs enjoy superior advantages in addressing the challenges of generalizing graph tasks by eliminating the need for training graph learning models and reducing the cost of manual annotation. In this survey, we conduct a comprehensive investigation of existing LLM studies on graph data, which summarizes the relevant graph analytics tasks solved by advanced LLM models and points out the existing remaining challenges and future directions. Specifically, we study the key problems of LLM-based generative graph analytics (LLM-GGA) with three categories: LLM-based graph query processing (LLM-GQP), LLM-based graph inference and learning (LLM-GIL), and graph-LLM-based applications. LLM-GQP focuses on an integration of graph analytics techniques and LLM prompts, including graph understanding and knowledge graph (KG) based augmented retrieval, while LLM-GIL focuses on learning and reasoning over graphs, including graph learning, graph-formed reasoning and graph representation. We summarize the useful prompts incorporated into LLM to handle different graph downstream tasks. Moreover, we give a summary of LLM model evaluation, benchmark datasets/tasks, and a deep pro and cons analysis of LLM models. We also explore open problems and future directions in this exciting interdisciplinary research area of LLMs and graph analytics.
- [1888] arXiv:2404.14822 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CNN2GNN: How to Bridge CNN with GNNSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Although the convolutional neural network (CNN) has achieved excellent performance in vision tasks by extracting the intra-sample representation, it will take a higher training expense because of stacking numerous convolutional layers. Recently, as the bilinear models, graph neural networks (GNN) have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, it cannot be directly utilized on non-graph data due to the lack of graph structure and has high inference latency on large-scale scenarios. Inspired by these complementary strengths and weaknesses, \textit{we discuss a natural question, how to bridge these two heterogeneous networks?} In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. Firstly, to break the limitations of GNN, a differentiable sparse graph learning module is designed as the head of networks to dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge from CNN to GNN and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of distilled ``boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.
- [1889] arXiv:2404.14830 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CoProNN: Concept-based Prototypical Nearest Neighbors for Explaining Vision ModelsComments: 24 pages, 9 figures, 2 tables, accepted at WCXAI 2024 VallettaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Mounting evidence in explainability for artificial intelligence (XAI) research suggests that good explanations should be tailored to individual tasks and should relate to concepts relevant to the task. However, building task specific explanations is time consuming and requires domain expertise which can be difficult to integrate into generic XAI methods. A promising approach towards designing useful task specific explanations with domain experts is based on compositionality of semantic concepts. Here, we present a novel approach that enables domain experts to quickly create concept-based explanations for computer vision tasks intuitively via natural language. Leveraging recent progress in deep generative methods we propose to generate visual concept-based prototypes via text-to-image methods. These prototypes are then used to explain predictions of computer vision models via a simple k-Nearest-Neighbors routine. The modular design of CoProNN is simple to implement, it is straightforward to adapt to novel tasks and allows for replacing the classification and text-to-image models as more powerful models are released. The approach can be evaluated offline against the ground-truth of predefined prototypes that can be easily communicated also to domain experts as they are based on visual concepts. We show that our strategy competes very well with other concept-based XAI approaches on coarse grained image classification tasks and may even outperform those methods on more demanding fine grained tasks. We demonstrate the effectiveness of our method for human-machine collaboration settings in qualitative and quantitative user studies. All code and experimental data can be found in our GitHub $\href{ this https URL }{repository}$.
- [1890] arXiv:2404.14851 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: From Matching to Generation: A Survey on Generative Information RetrievalSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Information Retrieval (IR) systems are crucial tools for users to access information, which have long been dominated by traditional methods relying on similarity matching. With the advancement of pre-trained language models, generative information retrieval (GenIR) emerges as a novel paradigm, attracting increasing attention. Currently, research in GenIR can be categorized into two aspects: generative document retrieval (GR) and reliable response generation. GR leverages the generative model's parameters for memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit indexing. Reliable response generation, on the other hand, employs language models to directly generate the information users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching, offering more flexibility, efficiency, and creativity, thus better meeting practical needs. This paper aims to systematically review the latest research progress in GenIR. We will summarize the advancements in GR regarding model training and structure, document identifier, incremental learning, etc., as well as progress in reliable response generation in aspects of internal knowledge memorization, external knowledge augmentation, etc. We also review the evaluation, challenges and future developments in GenIR systems. This review aims to offer a comprehensive reference for researchers, encouraging further development in the GenIR field.
- [1891] arXiv:2404.14871 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Exploring Human-AI Collaboration in Agile: Customised LLM Meeting AssistantsComments: International Conference on Agile Software Development (XP 2024), 14 pagesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: This action research study focuses on the integration of "AI assistants" in two Agile software development meetings: the Daily Scrum and a feature refinement, a planning meeting that is part of an in-house Scaled Agile framework. We discuss the critical drivers of success, and establish a link between the use of AI and team collaboration dynamics. We conclude with a list of lessons learnt during the interventions in an industrial context, and provide a assessment checklist for companies and teams to reflect on their readiness level. This paper is thus a road-map to facilitate the integration of AI tools in Agile setups.
- [1892] arXiv:2404.14897 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond the Speculative Game: A Survey of Speculative Execution in Large Language ModelsComments: 10 pages, 4 figures, 1 table, rejected from IJCAI 2024, revision in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs, where tokens can only be generated sequentially during decoding. To alleviate the bottleneck, the idea of speculative execution, which originates from the field of computer architecture, is introduced to LLM decoding in a \textit{draft-then-verify} style. Under this regime, a sequence of tokens will be drafted in a fast pace by utilizing some heuristics, and then the tokens shall be verified in parallel by the LLM. As the costly sequential inference is parallelized, LLM decoding speed can be significantly boosted. Driven by the success of LLMs in recent couple of years, a growing literature in this direction has emerged. Yet, there lacks a position survey to summarize the current landscape and draw a roadmap for future development of this promising area. To meet this demand, we present the very first survey paper that reviews and unifies literature of speculative execution in LLMs (e.g., blockwise parallel decoding, speculative decoding, etc.) in a comprehensive framework and a systematic taxonomy. Based on the taxonomy, we present a critical review and comparative analysis of the current arts. Finally we highlight various key challenges and future directions to further develop the area.
- [1893] arXiv:2404.14901 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering PracticeComments: Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE) 2024Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that, rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.
- [1894] arXiv:2404.14906 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Driver Activity Classification Using Generalizable Representations from Vision-Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.
- [1895] arXiv:2404.14928 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph Machine Learning in the Era of Large Language Models (LLMs)Wenqi Fan , Shijie Wang , Jiani Huang , Zhikai Chen , Yu Song , Wenzhuo Tang , Haitao Mao , Hui Liu , Xiaorui Liu , Dawei Yin , Qing LiSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Abstract: Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graph structures. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML's generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph heterogeneity and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.
- [1896] arXiv:2404.14933 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Fin-Fed-OD: Federated Outlier Detection on Financial Tabular DataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Anomaly detection in real-world scenarios poses challenges due to dynamic and often unknown anomaly distributions, requiring robust methods that operate under an open-world assumption. This challenge is exacerbated in practical settings, where models are employed by private organizations, precluding data sharing due to privacy and competitive concerns. Despite potential benefits, the sharing of anomaly information across organizations is restricted. This paper addresses the question of enhancing outlier detection within individual organizations without compromising data confidentiality. We propose a novel method leveraging representation learning and federated learning techniques to improve the detection of unknown anomalies. Specifically, our approach utilizes latent representations obtained from client-owned autoencoders to refine the decision boundary of inliers. Notably, only model parameters are shared between organizations, preserving data privacy. The efficacy of our proposed method is evaluated on two standard financial tabular datasets and an image dataset for anomaly detection in a distributed setting. The results demonstrate a strong improvement in the classification of unknown outliers during the inference phase for each organization's model.
- [1897] arXiv:2404.14941 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Delayed Bottlenecking: Alleviating Forgetting in Pre-trained Graph Neural NetworksZhe Zhao , Pengkun Wang , Xu Wang , Haibin Wen , Xiaolong Xie , Zhengyang Zhou , Qingfu Zhang , Yang WangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.
- [1898] arXiv:2404.14952 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Leveraging Speech for Gesture Detection in Multimodal CommunicationEsam Ghaleb , Ilya Burenko , Marlou Rasenberg , Wim Pouw , Ivan Toni , Peter Uhrig , Anna Wilson , Judith Holler , Aslı Özyürek , Raquel FernándezSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
- [1899] arXiv:2404.14963 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better ReasonersComments: Work in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Chain of Thought prompting strategy has enhanced the performance of Large Language Models (LLMs) across various NLP tasks. However, it still has shortcomings when dealing with complex reasoning tasks, including understanding errors, calculation errors and process errors (e.g., missing-step and hallucinations). Subsequently, our in-depth analyses among various error types show that deeply understanding the whole problem is critical in addressing complicated reasoning tasks. Motivated by this, we propose a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to enhance the LLMs' reasoning abilities. The core of our method is to encourage the LLMs to deeply understand the problems and leverage the key problem-solving information for better reasoning. Extensive experiments on 10 diverse reasoning benchmarks show that our DUP method consistently outperforms the other counterparts by a large margin. More encouragingly, DUP achieves a new SOTA result on the GSM8K benchmark, with an accuracy of 97.1% in a zero-shot setting.
- [1900] arXiv:2404.14966 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space ModelComments: 10 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity.
- [1901] arXiv:2404.14967 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CoARF: Controllable 3D Artistic Style Transfer for Radiance FieldsComments: International Conference on 3D Vision 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: Creating artistic 3D scenes can be time-consuming and requires specialized knowledge. To address this, recent works such as ARF, use a radiance field-based approach with style constraints to generate 3D scenes that resemble a style image provided by the user. However, these methods lack fine-grained control over the resulting scenes. In this paper, we introduce Controllable Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene stylization. CoARF enables style transfer for specified objects, compositional 3D style transfer and semantic-aware style transfer. We achieve controllability using segmentation masks with different label-dependent loss functions. We also propose a semantic-aware nearest neighbor matching algorithm to improve the style transfer quality. Our extensive experiments demonstrate that CoARF provides user-specified controllability of style transfer and superior style transfer quality with more precise feature matching.
- [1902] arXiv:2404.14979 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SGFormer: Spherical Geometry Transformer for 360 Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Panoramic distortion poses a significant challenge in 360 depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, which can result in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar re-projection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions.
- [1903] arXiv:2404.14986 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: $\texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular LearningKerstin Kläser , Błażej Banaszewski , Samuel Maddrell-Mander , Callum McLean , Luis Müller , Ali Parviz , Shenyang Huang , Andrew FitzgibbonSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on models with large parameter capacities. In this work, we propose $\texttt{MiniMol}$, a foundational model for molecular learning with 10 million parameters. $\texttt{MiniMol}$ is pre-trained on a mix of roughly 3300 sparsely defined graph- and node-level tasks of both quantum and biological nature. The pre-training dataset includes approximately 6 million molecules and 500 million labels. To demonstrate the generalizability of $\texttt{MiniMol}$ across tasks, we evaluate it on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group showing significant improvements over the prior state-of-the-art foundation model across 17 tasks. $\texttt{MiniMol}$ will be a public and open-sourced model for future research.
- [1904] arXiv:2404.14994 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Transformers Can Represent $n$-gram Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
Abstract: Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
- [1905] arXiv:2404.15022 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A review of deep learning-based information fusion techniques for multimodal medical image classificationYihao Li , Mostafa El Habib Daho , Pierre-Henri Conze , Rachid Zeghlache , Hugo Le Boité , Ramin Tadayoni , Béatrice Cochener , Mathieu Lamard , Gwenolé QuellecSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of the developments in deep learning-based multimodal fusion for medical classification tasks. We explore the complementary relationships among prevalent clinical modalities and outline three main fusion schemes for multimodal classification networks: input fusion, intermediate fusion (encompassing single-level fusion, hierarchical fusion, and attention-based fusion), and output fusion. By evaluating the performance of these fusion techniques, we provide insight into the suitability of different network architectures for various multimodal fusion scenarios and application domains. Furthermore, we delve into challenges related to network architecture selection, handling incomplete multimodal data management, and the potential limitations of multimodal fusion. Finally, we spotlight the promising future of Transformer-based multimodal fusion techniques and give recommendations for future research in this rapidly evolving field.
- [1906] arXiv:2404.15034 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Multi-View Channel-Wise Spatio-Temporal Network for Traffic Flow PredictionComments: Accepted by AAAI2020 workshopSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Accurately forecasting traffic flows is critically important to many real applications including public safety and intelligent transportation systems. The challenges of this problem include both the dynamic mobility patterns of the people and the complex spatial-temporal correlations of the urban traffic data. Meanwhile, most existing models ignore the diverse impacts of the various traffic observations (e.g. vehicle speed and road occupancy) on the traffic flow prediction, and different traffic observations can be considered as different channels of input features. We argue that the analysis in multiple-channel traffic observations might help to better address this problem. In this paper, we study the novel problem of multi-channel traffic flow prediction, and propose a deep \underline{M}ulti-\underline{V}iew \underline{C}hannel-wise \underline{S}patio-\underline{T}emporal \underline{Net}work (MVC-STNet) model to effectively address it. Specifically, we first construct the localized and globalized spatial graph where the multi-view fusion module is used to effectively extract the local and global spatial dependencies. Then LSTM is used to learn the temporal correlations. To effectively model the different impacts of various traffic observations on traffic flow prediction, a channel-wise graph convolutional network is also designed. Extensive experiments are conducted over the PEMS04 and PEMS08 datasets. The results demonstrate that the proposed MVC-STNet outperforms state-of-the-art methods by a large margin.
- [1907] arXiv:2404.15042 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Leverage Variational Graph Representation For Model Poisoning on Federated LearningComments: 12 pages, 8 figures, 2 tablesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: This paper puts forth a new training data-untethered model poisoning (MP) attack on federated learning (FL). The new MP attack extends an adversarial variational graph autoencoder (VGAE) to create malicious local models based solely on the benign local models overheard without any access to the training data of FL. Such an advancement leads to the VGAE-MP attack that is not only efficacious but also remains elusive to detection. VGAE-MP attack extracts graph structural correlations among the benign local models and the training data features, adversarially regenerates the graph structure, and generates malicious local models using the adversarial graph structure and benign models' features. Moreover, a new attacking algorithm is presented to train the malicious local models using VGAE and sub-gradient descent, while enabling an optimal selection of the benign local models for training the VGAE. Experiments demonstrate a gradual drop in FL accuracy under the proposed VGAE-MP attack and the ineffectiveness of existing defense mechanisms in detecting the attack, posing a severe threat to FL.
- [1908] arXiv:2404.15045 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multi-Head Mixture-of-ExpertsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.
- [1909] arXiv:2404.15058 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AISeliem El-Sayed , Canfer Akbulut , Amanda McCroskery , Geoff Keeling , Zachary Kenton , Zaria Jalan , Nahema Marchal , Arianna Manzini , Toby Shevlane , Shannon Vallor , Daniel Susser , Matija Franklin , Sophie Bridgers , Harry Law , Matthew Rahtz , Murray Shanahan , Michael Henry Tessler , Arthur Douillard , Tom Everitt , Sasha BrownSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Recent generative AI systems have demonstrated more advanced persuasive capabilities and are increasingly permeating areas of life where they can influence decision-making. Generative AI presents a new risk profile of persuasion due the opportunity for reciprocal exchange and prolonged interactions. This has led to growing concerns about harms from AI persuasion and how they can be mitigated, highlighting the need for a systematic study of AI persuasion. The current definitions of AI persuasion are unclear and related harms are insufficiently studied. Existing harm mitigation approaches prioritise harms from the outcome of persuasion over harms from the process of persuasion. In this paper, we lay the groundwork for the systematic study of AI persuasion. We first put forward definitions of persuasive generative AI. We distinguish between rationally persuasive generative AI, which relies on providing relevant facts, sound reasoning, or other forms of trustworthy evidence, and manipulative generative AI, which relies on taking advantage of cognitive biases and heuristics or misrepresenting information. We also put forward a map of harms from AI persuasion, including definitions and examples of economic, physical, environmental, psychological, sociocultural, political, privacy, and autonomy harm. We then introduce a map of mechanisms that contribute to harmful persuasion. Lastly, we provide an overview of approaches that can be used to mitigate against process harms of persuasion, including prompt engineering for manipulation classification and red teaming. Future work will operationalise these mitigations and study the interaction between different types of mechanisms of persuasion.
- [1910] arXiv:2404.15065 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Formal Verification of Graph Convolutional Networks with Uncertain Node Features and Uncertain Graph StructureComments: under reviewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph neural networks are becoming increasingly popular in the field of machine learning due to their unique ability to process data structured in graphs. They have also been applied in safety-critical environments where perturbations inherently occur. However, these perturbations require us to formally verify neural networks before their deployment in safety-critical environments as neural networks are prone to adversarial attacks. While there exists research on the formal verification of neural networks, there is no work verifying the robustness of generic graph convolutional network architectures with uncertainty in the node features and in the graph structure over multiple message-passing steps. This work addresses this research gap by explicitly preserving the non-convex dependencies of all elements in the underlying computations through reachability analysis with (matrix) polynomial zonotopes. We demonstrate our approach on three popular benchmark datasets.
- [1911] arXiv:2404.15070 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: BotDGT: Dynamicity-aware Social Bot Detection with Dynamic Graph TransformersBuyun He , Yingguang Yang , Qi Wu , Hao Liu , Renyu Yang , Hao Peng , Xiang Wang , Yong Liao , Pengyuan ZhouComments: IJCAI 2024Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: Detecting social bots has evolved into a pivotal yet intricate task, aimed at combating the dissemination of misinformation and preserving the authenticity of online interactions. While earlier graph-based approaches, which leverage topological structure of social networks, yielded notable outcomes, they overlooked the inherent dynamicity of social networks -- In reality, they largely depicted the social network as a static graph and solely relied on its most recent state. Due to the absence of dynamicity modeling, such approaches are vulnerable to evasion, particularly when advanced social bots interact with other users to camouflage identities and escape detection. To tackle these challenges, we propose BotDGT, a novel framework that not only considers the topological structure, but also effectively incorporates dynamic nature of social network. Specifically, we characterize a social network as a dynamic graph. A structural module is employed to acquire topological information from each historical snapshot. Additionally, a temporal module is proposed to integrate historical context and model the evolving behavior patterns exhibited by social bots and legitimate users. Experimental results demonstrate the superiority of BotDGT against the leading methods that neglected the dynamic nature of social networks in terms of accuracy, recall, and F1-score.
- [1912] arXiv:2404.15121 (cross-list from cs.GR) [ pdf , ps , html , other ]
-
Title: Taming Diffusion Probabilistic Models for Character ControlComments: Accepted by SIGGRAPH 2024 (Conference Track). Project page and source codes: this https URLSubjects: Graphics (cs.GR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first model that enables real-time generation of high-quality, diverse character animations based on user interactive control, supporting animating the character in multiple styles with a single unified model. We evaluate our method on a diverse set of locomotion skills, demonstrating the merits of our method over existing character controllers. Project page and source codes: this https URL
- [1913] arXiv:2404.15141 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation MethodSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.
- [1914] arXiv:2404.15153 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Expert Router: Orchestrating Efficient Language Model Inference through Prompt ClassificationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Performance (cs.PF)
Abstract: Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of the high computational and memory demands associated with LLMs. To tackle this limitation, we introduce Expert Router, a system designed to orchestrate multiple expert models efficiently, thereby enhancing scalability. Expert Router is a parallel inference system with a central routing gateway that distributes incoming requests using a clustering method. This approach effectively partitions incoming requests among available LLMs, maximizing overall throughput. Our extensive evaluations encompassed up to 1,000 concurrent users, providing comprehensive insights into the system's behavior from user and infrastructure perspectives. The results demonstrate Expert Router's effectiveness in handling high-load scenarios and achieving higher throughput rates, particularly under many concurrent users.
- [1915] arXiv:2404.15154 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Do not think pink elephant!Comments: This paper is accepted in CVPRWSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Models (LMs) have heightened expectations for the potential of general AI as they are akin to human intelligence. This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the "white bear phenomenon". We investigate the causes of the white bear phenomenon by analyzing their representation space. Based on this analysis, we propose a simple prompt-based attack method, which generates figures prohibited by the LM provider's policy. To counter these attacks, we introduce prompt-based defense strategies inspired by cognitive therapy techniques, successfully mitigating attacks by up to 48.22\%.
- [1916] arXiv:2404.15155 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Adaptive Collaboration Strategy for LLMs in Medical Decision MakingYubin Kim , Chanwoo Park , Hyewon Jeong , Yik Siu Chan , Xuhai Xu , Daniel McDuff , Cynthia Breazeal , Hae Won ParkSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Foundation models have become invaluable in advancing the medical field. Despite their promise, the strategic deployment of LLMs for effective utility in complex medical tasks remains an open question. Our novel framework, Medical Decision-making Agents (MDAgents) aims to address this gap by automatically assigning the effective collaboration structure for LLMs. Assigned solo or group collaboration structure is tailored to the complexity of the medical task at hand, emulating real-world medical decision making processes. We evaluate our framework and baseline methods with state-of-the-art LLMs across a suite of challenging medical benchmarks: MedQA, MedMCQA, PubMedQA, DDXPlus, PMC-VQA, Path-VQA, and MedVidQA, achieving the best performance in 5 out of 7 benchmarks that require an understanding of multi-modal medical reasoning. Ablation studies reveal that MDAgents excels in adapting the number of collaborating agents to optimize efficiency and accuracy, showcasing its robustness in diverse scenarios. We also explore the dynamics of group consensus, offering insights into how collaborative agents could behave in complex clinical team dynamics. Our code can be found at this https URL .
- [1917] arXiv:2404.15157 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FASTTRACK: Fast and Accurate Fact Tracing for LLMsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100\% improvement in F1 score over the state-of-the-art methods while being X33 faster than \texttt{TracIn}.
- [1918] arXiv:2404.15159 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of ExpertsDengchun Li , Yingzi Ma , Naizheng Wang , Zhiyuan Cheng , Lei Duan , Jie Zuo , Cal Yang , Mingjie TangComments: 11 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have showcased exceptional performance across a wide array of Natural Language Processing (NLP) tasks. Fine-tuning techniques are commonly utilized to tailor pre-trained models to specific applications. While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted to limited performance, especially on multi-task. On the other hand, Mix-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance across multiple NLP tasks while maintaining a reduced parameter count. However, the resource requirements of these MoEs still challenging, particularly for consumer-grade GPUs only have limited VRAM. To address these challenge, we propose MixLoRA, an innovative approach aimed at constructing a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model through fine-tuning, employing a commonly used top-k router. Unlike other LoRA based MoE methods, MixLoRA enhances model performance by utilizing independently configurable attention-layer LoRA adapters, supporting the use of LoRA and its variants for the construction of experts, and applying auxiliary load balance loss to address the imbalance problem of the router. In experiments, MixLoRA achieves commendable performance across all evaluation metrics in both single-task and multi-task learning scenarios. Implemented within the m-LoRA framework, MixLoRA enables parallel fine-tuning of multiple mixture-of-experts models on a single 24GB consumer-grade GPU without quantization, thereby reducing GPU memory consumption by 41\% and latency during the training process by 17\%.
- [1919] arXiv:2404.15166 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Pixels and Predictions: Potential of GPT-4V in Meteorological Imagery Analysis and Forecast CommunicationJohn R. Lawson , Montgomery L. Flora , Kevin H. Goebbert , Seth N. Lyman , Corey K. Potvin , David M. Schultz , Adam J. Stepanek , Joseph E. Trujillo-FalcónComments: Supplementary material PDF attached. Submitted to Artificial Intelligence for the Earth Systems (American Meteorological Society) on 18 April 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: Generative AI, such as OpenAI's GPT-4V large-language model, has rapidly entered mainstream discourse. Novel capabilities in image processing and natural-language communication may augment existing forecasting methods. Large language models further display potential to better communicate weather hazards in a style honed for diverse communities and different languages. This study evaluates GPT-4V's ability to interpret meteorological charts and communicate weather hazards appropriately to the user, despite challenges of hallucinations, where generative AI delivers coherent, confident, but incorrect responses. We assess GPT-4V's competence via its web interface ChatGPT in two tasks: (1) generating a severe-weather outlook from weather-chart analysis and conducting self-evaluation, revealing an outlook that corresponds well with a Storm Prediction Center human-issued forecast; and (2) producing hazard summaries in Spanish and English from weather charts. Responses in Spanish, however, resemble direct (not idiomatic) translations from English to Spanish, yielding poorly translated summaries that lose critical idiomatic precision required for optimal communication. Our findings advocate for cautious integration of tools like GPT-4V in meteorology, underscoring the necessity of human oversight and development of trustworthy, explainable AI.
- [1920] arXiv:2404.15182 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated LearningComments: 10 pages, 11 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generating nuanced relationships between text and images. However, the conventional training of such models often requires centralized aggregation of vast datasets, posing significant privacy and data governance challenges. To address these concerns, this paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters, i.e., Low-Rank Adaptation (LoRA), to train VLMs. This methodology preserves data privacy by training models across decentralized data sources and ensures model adaptability and efficiency through LoRA's parameter-efficient fine-tuning. Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.
- [1921] arXiv:2404.15193 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Structurally Flexible Neural Networks: Evolving the Building Blocks for General AgentsSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Artificial neural networks used for reinforcement learning are structurally rigid, meaning that each optimized parameter of the network is tied to its specific placement in the network structure. It also means that a network only works with pre-defined and fixed input- and output sizes. This is a consequence of having the number of optimized parameters being directly dependent on the structure of the network. Structural rigidity limits the ability to optimize parameters of policies across multiple environments that do not share input and output spaces. Here, we evolve a set of neurons and plastic synapses each represented by a gated recurrent unit (GRU). During optimization, the parameters of these fundamental units of a neural network are optimized in different random structural configurations. Earlier work has shown that parameter sharing between units is important for making structurally flexible neurons We show that it is possible to optimize a set of distinct neuron- and synapse types allowing for a mitigation of the symmetry dilemma. We demonstrate this by optimizing a single set of neurons and synapses to solve multiple reinforcement learning control tasks simultaneously.
- [1922] arXiv:2404.15204 (cross-list from cs.PL) [ pdf , ps , html , other ]
-
Title: Towards a high-performance AI compiler with upstream MLIRRenato Golin , Lorenzo Chelini , Adam Siemieniuk , Kavitha Madhu , Niranjan Hasabnis , Hans Pabst , Evangelos Georganas , Alexander HeineckeComments: 13 pages, 8 figures, presented at CGO C4ML 2024 & MLIR Workshop EuroLLVM 2024Subjects: Programming Languages (cs.PL) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Abstract: This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.
- [1923] arXiv:2404.15224 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Deep Models for Multi-View 3D Object Recognition: A ReviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Human decision-making often relies on visual information from multiple perspectives or views. In contrast, machine learning-based object recognition utilizes information from a single image of the object. However, the information conveyed by a single image may not be sufficient for accurate decision-making, particularly in complex recognition problems. The utilization of multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance. This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks. Specifically, we focus on deep learning-based and transformer-based techniques, as they are widely utilized and have achieved state-of-the-art performance. We provide detailed information about existing deep learning-based and transformer-based multi-view 3D object recognition models, including the most commonly used 3D datasets, camera configurations and number of views, view selection strategies, pre-trained CNN architectures, fusion strategies, and recognition performance on 3D classification and 3D retrieval tasks. Additionally, we examine various computer vision applications that use multi-view classification. Finally, we highlight key findings and future directions for developing multi-view 3D object recognition methods to provide readers with a comprehensive understanding of the field.
- [1924] arXiv:2404.15231 (cross-list from physics.optics) [ pdf , ps , html , other ]
-
Title: Direct Zernike Coefficient Prediction from Point Spread Functions and Extended Images using Deep LearningYong En Kok (1), Alexander Bentley (2), Andrew Parkes (1), Amanda J. Wright (2), Michael G. Somekh (2 and 3), Michael Pound (1) ((1) School of Computer Science, University of Nottingham, Nottingham, UK, (2) Optics and Photonics Group, Department of Electrical and Electronic Engineering, University of Nottingham, Nottingham, UK, (3) Research Center for Humanoid Sensing, Zhejiang Laboratory Hangzhou, China)Comments: 12 pages, 6 figures, 4 tablesSubjects: Optics (physics.optics) ; Artificial Intelligence (cs.AI)
Abstract: Optical imaging quality can be severely degraded by system and sample induced aberrations. Existing adaptive optics systems typically rely on iterative search algorithm to correct for aberrations and improve images. This study demonstrates the application of convolutional neural networks to characterise the optical aberration by directly predicting the Zernike coefficients from two to three phase-diverse optical images. We evaluated our network on 600,000 simulated Point Spread Function (PSF) datasets randomly generated within the range of -1 to 1 radians using the first 25 Zernike coefficients. The results show that using only three phase-diverse images captured above, below and at the focal plane with an amplitude of 1 achieves a low RMSE of 0.10 radians on the simulated PSF dataset. Furthermore, this approach directly predicts Zernike modes simulated extended 2D samples, while maintaining a comparable RMSE of 0.15 radians. We demonstrate that this approach is effective using only a single prediction step, or can be iterated a small number of times. This simple and straightforward technique provides rapid and accurate method for predicting the aberration correction using three or less phase-diverse images, paving the way for evaluation on real-world dataset.
- [1925] arXiv:2404.15238 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language TechnologiesWeiyan Shi , Ryan Li , Yutong Zhang , Caleb Ziems , Chunhua yu , Raya Horesh , Rogério Abreu de Paula , Diyi YangComments: 32 pages, 7 figures, preprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: To enhance language models' cultural awareness, we design a generalizable pipeline to construct cultural knowledge bases from different online communities on a massive scale. With the pipeline, we construct CultureBank, a knowledge base built upon users' self-narratives with 12K cultural descriptors sourced from TikTok and 11K from Reddit. Unlike previous cultural knowledge resources, CultureBank contains diverse views on cultural descriptors to allow flexible interpretation of cultural knowledge, and contextualized cultural scenarios to help grounded evaluation. With CultureBank, we evaluate different LLMs' cultural awareness, and identify areas for improvement. We also fine-tune a language model on CultureBank: experiments show that it achieves better performances on two downstream cultural tasks in a zero-shot setting. Finally, we offer recommendations based on our findings for future culturally aware language technologies. The project page is this https URL . The code and model is at this https URL . The released CultureBank dataset is at this https URL .
- [1926] arXiv:2404.15243 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: UCINet0: A Machine Learning based Receiver for 5G NR PUCCH Format 0Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Accurate decoding of Uplink Control Information (UCI) on the Physical Uplink Control Channel (PUCCH) is essential for enabling 5G wireless links. This paper explores an AI/ML-based receiver design for PUCCH Format 0. Format 0 signaling encodes the UCI content within the phase of a known base waveform and even supports multiplexing of up to 12 users within the same time-frequency resources. Our first-of-a-kind neural network classifier, which we term UCINet0, is capable of predicting when no user is transmitting on the PUCCH, as well as decoding the UCI content of any number of multiplexed users, up to 12. Inference results with both simulated and hardware-captured field datasets show that the UCINet0 model outperforms conventional DFT-based decoders across all SNR ranges.
- [1927] arXiv:2404.15247 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-ExpertsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Abstract: We introduce XFT, a simple yet powerful training scheme, by simply merging upcycled Mixture-of-Experts (MoE) to unleash the performance limit of instruction-tuned code Large Language Models (LLMs). While vanilla sparse upcycling fails to improve instruction tuning, XFT introduces a shared expert mechanism with a novel routing weight normalization strategy into sparse upcycling, which significantly boosts instruction tuning. After fine-tuning the upcycled MoE model, XFT introduces a learnable model merging mechanism to compile the upcycled MoE model back to a dense model, achieving upcycled MoE-level performance with only dense-model compute. By applying XFT to a 1.3B model, we create a new state-of-the-art tiny code LLM (<3B) with 67.1 and 64.6 pass@1 on HumanEval and HumanEval+ respectively. With the same data and model architecture, XFT improves supervised fine-tuning (SFT) by 13% on HumanEval+, along with consistent improvements from 2% to 13% on MBPP+, MultiPL-E, and DS-1000, demonstrating its generalizability. XFT is fully orthogonal to existing techniques such as Evol-Instruct and OSS-Instruct, opening a new dimension for improving code instruction tuning. Codes are available at this https URL .
- [1928] arXiv:2404.15256 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: TOP-Nav: Legged Navigation Integrating Terrain, Obstacle and Proprioception EstimationSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
Abstract: Legged navigation is typically examined within open-world, off-road, and challenging environments. In these scenarios, estimating external disturbances requires a complex synthesis of multi-modal information. This underlines a major limitation in existing works that primarily focus on avoiding obstacles. In this work, we propose TOP-Nav, a novel legged navigation framework that integrates a comprehensive path planner with Terrain awareness, Obstacle avoidance and close-loop Proprioception. TOP-Nav underscores the synergies between vision and proprioception in both path and motion planning. Within the path planner, we present and integrate a terrain estimator that enables the robot to select waypoints on terrains with higher traversability while effectively avoiding obstacles. In the motion planning level, we not only implement a locomotion controller to track the navigation commands, but also construct a proprioception advisor to provide motion evaluations for the path planner. Based on the close-loop motion feedback, we make online corrections for the vision-based terrain and obstacle estimations. Consequently, TOP-Nav achieves open-world navigation that the robot can handle terrains or disturbances beyond the distribution of prior knowledge and overcomes constraints imposed by visual conditions. Building upon extensive experiments conducted in both simulation and real-world environments, TOP-Nav demonstrates superior performance in open-world navigation compared to existing methods.
- [1929] arXiv:2404.15269 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Aligning LLM Agents by Learning Latent Preference from User EditsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: We study interactive learning of language agents based on user edits made to the agent's output. In a typical setting such as writing assistants, the user interacts with a language agent to generate a response given a context, and may optionally edit the agent response to personalize it based on their latent preference, in addition to improving the correctness. The edit feedback is naturally generated, making it a suitable candidate for improving the agent's alignment with the user's preference, and for reducing the cost of user edits over time. We propose a learning framework, PRELUDE that infers a description of the user's latent preference based on historic edit data and using it to define a prompt policy that drives future response generation. This avoids fine-tuning the agent, which is costly, challenging to scale with the number of users, and may even degrade its performance on other tasks. Furthermore, learning descriptive preference improves interpretability, allowing the user to view and modify the learned preference. However, user preference can be complex and vary based on context, making it challenging to learn. To address this, we propose a simple yet effective algorithm named CIPHER that leverages a large language model (LLM) to infer the user preference for a given context based on user edits. In the future, CIPHER retrieves inferred preferences from the k-closest contexts in the history, and forms an aggregate preference for response generation. We introduce two interactive environments -- summarization and email writing, for evaluation using a GPT-4 simulated user. We compare with algorithms that directly retrieve user edits but do not learn descriptive preference, and algorithms that learn context-agnostic preference. On both tasks, CIPHER achieves the lowest edit distance cost and learns preferences that show significant similarity to the ground truth preferences
- [1930] arXiv:2404.15271 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Automatic Layout Planning for Visually-Rich Documents with Instruction-Following ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.
- [1931] arXiv:2404.15272 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body ScenariosComments: 12 pages, 5 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse contrastive pairs. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
- [1932] arXiv:2404.15276 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose EstimationComments: Published at TPAMI 2024Journal-ref: https://www.computer.org/csdl/journal/tp/2024/05/10354384/1SP2qWh8Fq0Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at this https URL .
- [1933] arXiv:2404.15279 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action ClassificationComments: Accepted by AAAI 2024Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics.
- [1934] arXiv:2404.15282 (cross-list from cs.DL) [ pdf , ps , html , other ]
-
Title: Patent Value Characterization -- An Empirical Analysis of Elevator Industry PatentsSubjects: Digital Libraries (cs.DL) ; Artificial Intelligence (cs.AI)
Abstract: The global patent application count has steadily increased, achieving eight consecutive years of growth.The global patent industry has shown a general trend of expansion. This is attributed to the increasing innovation activities, particularly in the fields of technology, healthcare, and biotechnology. Some emerging market countries, such as China and India, have experienced significant growth in the patent domain, becoming important participants in global patent activities.
- [1935] arXiv:2404.15284 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Global 4D Ionospheric STEC Prediction based on DeepONet for GNSS RaysSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: The ionosphere is a vitally dynamic charged particle region in the Earth's upper atmosphere, playing a crucial role in applications such as radio communication and satellite navigation. The Slant Total Electron Contents (STEC) is an important parameter for characterizing wave propagation, representing the integrated electron density along the ray of radio signals passing through the ionosphere. The accurate prediction of STEC is essential for mitigating the ionospheric impact particularly on Global Navigation Satellite Systems (GNSS). In this work, we propose a high-precision STEC prediction model named DeepONet-STEC, which learns nonlinear operators to predict the 4D temporal-spatial integrated parameter for specified ground station - satellite ray path globally. As a demonstration, we validate the performance of the model based on GNSS observation data for global and US-CORS regimes under ionospheric quiet and storm conditions. The DeepONet-STEC model results show that the three-day 72 hour prediction in quiet periods could achieve high accuracy using observation data by the Precise Point Positioning (PPP) with temporal resolution 30s. Under active solar magnetic storm periods, the DeepONet-STEC also demonstrated its robustness and superiority than traditional deep learning methods. This work presents a neural operator regression architecture for predicting the 4D temporal-spatial ionospheric parameter for satellite navigation system performance, which may be further extended for various space applications and beyond.
- [1936] arXiv:2404.15307 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: DCAE-SR: Design of a Denoising Convolutional Autoencoder for reconstructing Electrocardiograms signals at Super ResolutionSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: Electrocardiogram (ECG) signals play a pivotal role in cardiovascular diagnostics, providing essential information on the electrical activity of the heart. However, the inherent noise and limited resolution in ECG recordings can hinder accurate interpretation and diagnosis. In this paper, we propose a novel model for ECG super resolution (SR) that uses a DNAE to enhance temporal and frequency information inside ECG signals. Our approach addresses the limitations of traditional ECG signal processing techniques. Our model takes in input 5-second length ECG windows sampled at 50 Hz (very low resolution) and it is able to reconstruct a denoised super-resolution signal with an x10 upsampling rate (sampled at 500 Hz). We trained the proposed DCAE-SR on public available myocardial infraction ECG signals. Our method demonstrates superior performance in reconstructing high-resolution ECG signals from very low-resolution signals with a sampling rate of 50 Hz. We compared our results with the current deep-learning literature approaches for ECG super-resolution and some non-deep learning reproducible methods that can perform both super-resolution and denoising. We obtained current state-of-the-art performances in super-resolution of very low resolution ECG signals frequently corrupted by ECG artifacts. We were able to obtain a signal-to-noise ratio of 12.20 dB (outperforms previous 4.68 dB), mean squared error of 0.0044 (outperforms previous 0.0154) and root mean squared error of 4.86% (outperforms previous 12.40%). In conclusion, our DCAE-SR model offers a robust (to artefact presence), versatile and explainable solution to enhance the quality of ECG signals. This advancement holds promise in advancing the field of cardiovascular diagnostics, paving the way for improved patient care and high-quality clinical decisions
- [1937] arXiv:2404.15310 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPTRuikun Hou , Tim Fütterer , Babette Bühler , Efe Bozkir , Peter Gerjets , Ulrich Trautwein , Enkelejda KasneciComments: Accepted as a full paper by the 25th International Conference on Artificial Intelligence in Education (AIED 2024)Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Classroom observation protocols standardize the assessment of teaching effectiveness and facilitate comprehension of classroom interactions. Whereas these protocols offer teachers specific feedback on their teaching practices, the manual coding by human raters is resource-intensive and often unreliable. This has sparked interest in developing AI-driven, cost-effective methods for automating such holistic coding. Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms, a key component of the Global Teaching Insights (GTI) study's observation protocol. To this end, we employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data. The prediction task involved both classification and regression methods. Additionally, in light of recent large language models' remarkable text annotation capabilities, we evaluated ChatGPT's zero-shot performance on this scoring task based on transcripts. We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings. The inferences of GPT-4 and the best-trained model yielded correlations of r = .341 and r = .441 with human ratings, respectively. Combining estimates from both models through averaging, an ensemble approach achieved a correlation of r = .513, comparable to human inter-rater reliability. Our model explanation analysis indicated that text sentiment features were the primary contributors to the trained model's decisions. Moreover, GPT-4 could deliver logical and concrete reasoning as potential teacher guidelines. Our findings provide insights into using advanced, multimodal techniques for automated classroom observation, aiming to foster teacher training through frequent and valuable feedback.
- [1938] arXiv:2404.15311 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Fusing Pretrained ViTs with TCNet for Enhanced EEG RegressionComments: Accepted HCI International 2024Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The task of Electroencephalogram (EEG) analysis is paramount to the development of Brain-Computer Interfaces (BCIs). However, to reach the goal of developing robust, useful BCIs depends heavily on the speed and the accuracy at which BCIs can understand neural dynamics. In response to that goal, this paper details the integration of pre-trained Vision Transformers (ViTs) with Temporal Convolutional Networks (TCNet) to enhance the precision of EEG regression. The core of this approach lies in harnessing the sequential data processing strengths of ViTs along with the superior feature extraction capabilities of TCNet, to significantly improve EEG analysis accuracy. In addition, we analyze the importance of how to construct optimal patches for the attention mechanism to analyze, balancing both speed and accuracy tradeoffs. Our results showcase a substantial improvement in regression accuracy, as evidenced by the reduction of Root Mean Square Error (RMSE) from 55.4 to 51.8 on EEGEyeNet's Absolute Position Task, outperforming existing state-of-the-art models. Without sacrificing performance, we increase the speed of this model by an order of magnitude (up to 4.32x faster). This breakthrough not only sets a new benchmark in EEG regression analysis but also opens new avenues for future research in the integration of transformer architectures with specialized feature extraction methods for diverse EEG datasets.
- [1939] arXiv:2404.15319 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: The largest EEG-based BCI reproducibility study for open science: the MOABB benchmarkSylvain Chevallier , Igor Carrara , Bruno Aristimunha , Pierre Guetschel , Sara Sedlar , Bruna Lopes , Sebastien Velut , Salim Khazem , Thomas MoreauComments: 43 pages, 13 figures, 5 tablesSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Abstract: Objective. This study conduct an extensive Brain-computer interfaces (BCI) reproducibility analysis on open electroencephalography datasets, aiming to assess existing solutions and establish open and reproducible benchmarks for effective comparison within the field. The need for such benchmark lies in the rapid industrial progress that has given rise to undisclosed proprietary solutions. Furthermore, the scientific literature is dense, often featuring challenging-to-reproduce evaluations, making comparisons between existing approaches arduous.
Approach. Within an open framework, 30 machine learning pipelines (separated into raw signal: 11, Riemannian: 13, deep learning: 6) are meticulously re-implemented and evaluated across 36 publicly available datasets, including motor imagery (14), P300 (15), and SSVEP (7). The analysis incorporates statistical meta-analysis techniques for results assessment, encompassing execution time and environmental impact considerations.
Main results. The study yields principled and robust results applicable to various BCI paradigms, emphasizing motor imagery, P300, and SSVEP. Notably, Riemannian approaches utilizing spatial covariance matrices exhibit superior performance, underscoring the necessity for significant data volumes to achieve competitive outcomes with deep learning techniques. The comprehensive results are openly accessible, paving the way for future research to further enhance reproducibility in the BCI domain.
Significance. The significance of this study lies in its contribution to establishing a rigorous and transparent benchmark for BCI research, offering insights into optimal methodologies and highlighting the importance of reproducibility in driving advancements within the field. - [1940] arXiv:2404.15320 (cross-list from cs.DL) [ pdf , ps , html , other ]
-
Title: Using Large Language Models to Enrich the Documentation of Datasets for Machine LearningSubjects: Digital Libraries (cs.DL) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them.
In this paper, we evaluate the approach on 12 scientific dataset papers published in two scientific journals (Nature's Scientific Data and Elsevier's Data in Brief) using two different LLMs (GPT3.5 and Flan-UL2). Results show good accuracy with our prompt extraction strategies. Concrete results vary depending on the dimensions, but overall, GPT3.5 shows slightly better accuracy (81,21%) than FLAN-UL2 (69,13%) although it is more prone to hallucinations. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results, in an open-source repository. - [1941] arXiv:2404.15324 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Advanced simulation-based predictive modelling for solar irradiance sensor farmsJournal-ref: Journal of Simulation, pp. 1-18, 2024Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: As solar power continues to grow and replace traditional energy sources, the need for reliable forecasting models becomes increasingly important to ensure the stability and efficiency of the grid. However, the management of these models still needs to be improved, and new tools and technologies are required to handle the deployment and control of solar facilities. This work introduces a novel framework named Cloud-based Analysis and Integration for Data Efficiency (CAIDE), designed for real-time monitoring, management, and forecasting of solar irradiance sensor farms. CAIDE is designed to manage multiple sensor farms simultaneously while improving predictive models in real-time using well-grounded Modeling and Simulation (M&S) methodologies. The framework leverages Model Based Systems Engineering (MBSE) and an Internet of Things (IoT) infrastructure to support the deployment and analysis of solar plants in dynamic environments. The system can adapt and re-train the model when given incorrect results, ensuring that forecasts remain accurate and up-to-date. Furthermore, CAIDE can be executed in sequential, parallel, and distributed architectures, assuring scalability. The effectiveness of CAIDE is demonstrated in a complex scenario composed of several solar irradiance sensor farms connected to a centralized management system. Our results show that CAIDE is scalable and effective in managing and forecasting solar power production while improving the accuracy of predictive models in real time. The framework has important implications for the deployment of solar plants and the future of renewable energy sources.
- [1942] arXiv:2404.15347 (cross-list from eess.SP) [ pdf , ps , other ]
-
Title: Advanced Neural Network Architecture for Enhanced Multi-Lead ECG Arrhythmia Detection through Optimized Feature ExtractionSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Cardiovascular diseases are a pervasive global health concern, contributing significantly to morbidity and mortality rates worldwide. Among these conditions, arrhythmia, characterized by irregular heart rhythms, presents formidable diagnostic challenges. This study introduces an innovative approach utilizing deep learning techniques, specifically Convolutional Neural Networks (CNNs), to address the complexities of arrhythmia classification. Leveraging multi-lead Electrocardiogram (ECG) data, our CNN model, comprising six layers with a residual block, demonstrates promising outcomes in identifying five distinct heartbeat types: Left Bundle Branch Block (LBBB), Right Bundle Branch Block (RBBB), Atrial Premature Contraction (APC), Premature Ventricular Contraction (PVC), and Normal Beat. Through rigorous experimentation, we highlight the transformative potential of our methodology in enhancing diagnostic accuracy for cardiovascular arrhythmias. Arrhythmia diagnosis remains a critical challenge in cardiovascular care, often relying on manual interpretation of ECG signals, which can be time-consuming and prone to subjectivity. To address these limitations, we propose a novel approach that leverages deep learning algorithms to automate arrhythmia classification. By employing advanced CNN architectures and multi-lead ECG data, our methodology offers a robust solution for precise and efficient arrhythmia detection. Through comprehensive evaluation, we demonstrate the effectiveness of our approach in facilitating more accurate clinical decision-making, thereby improving patient outcomes in managing cardiovascular arrhythmias.
- [1943] arXiv:2404.15353 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: SQUWA: Signal Quality Aware DNN Architecture for Enhanced Accuracy in Atrial Fibrillation Detection from Noisy PPG SignalsComments: 15 pages; 9 figures; 2024 Conference on Health, Inference, and Learning (CHIL)Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Atrial fibrillation (AF), a common cardiac arrhythmia, significantly increases the risk of stroke, heart disease, and mortality. Photoplethysmography (PPG) offers a promising solution for continuous AF monitoring, due to its cost efficiency and integration into wearable devices. Nonetheless, PPG signals are susceptible to corruption from motion artifacts and other factors often encountered in ambulatory settings. Conventional approaches typically discard corrupted segments or attempt to reconstruct original signals, allowing for the use of standard machine learning techniques. However, this reduces dataset size and introduces biases, compromising prediction accuracy and the effectiveness of continuous monitoring. We propose a novel deep learning model, Signal Quality Weighted Fusion of Attentional Convolution and Recurrent Neural Network (SQUWA), designed to learn how to retain accurate predictions from partially corrupted PPG. Specifically, SQUWA innovatively integrates an attention mechanism that directly considers signal quality during the learning process, dynamically adjusting the weights of time series segments based on their quality. This approach enhances the influence of higher-quality segments while reducing that of lower-quality ones, effectively utilizing partially corrupted segments. This approach represents a departure from the conventional methods that exclude such segments, enabling the utilization of a broader range of data, which has great implications for less disruption when monitoring of AF risks and more accurate estimation of AF burdens. Our extensive experiments show that SQUWA outperform existing PPG-based models, achieving the highest AUCPR of 0.89 with label noise mitigation. This also exceeds the 0.86 AUCPR of models trained with using both electrocardiogram (ECG) and PPG data.
- [1944] arXiv:2404.15354 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Elevating Spectral GNNs through Enhanced Band-pass Filter ApproximationComments: PreprintSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
Abstract: Spectral Graph Neural Networks (GNNs) have attracted great attention due to their capacity to capture patterns in the frequency domains with essential graph filters. Polynomial-based ones (namely poly-GNNs), which approximately construct graph filters with conventional or rational polynomials, are routinely adopted in practice for their substantial performances on graph learning tasks. However, previous poly-GNNs aim at achieving overall lower approximation error on different types of filters, e.g., low-pass and high-pass, but ignore a key question: \textit{which type of filter warrants greater attention for poly-GNNs?} In this paper, we first show that poly-GNN with a better approximation for band-pass graph filters performs better on graph learning tasks. This insight further sheds light on critical issues of existing poly-GNNs, i.e., those poly-GNNs achieve trivial performance in approximating band-pass graph filters, hindering the great potential of poly-GNNs. To tackle the issues, we propose a novel poly-GNN named TrigoNet. TrigoNet constructs different graph filters with novel trigonometric polynomial, and achieves leading performance in approximating band-pass graph filters against other polynomials. By applying Taylor expansion and deserting nonlinearity, TrigoNet achieves noticeable efficiency among baselines. Extensive experiments show the advantages of TrigoNet in both accuracy performances and efficiency.
- [1945] arXiv:2404.15360 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Towards Robust and Interpretable EMG-based Hand Gesture Recognition using Deep Metric Meta LearningSimon Tam , Shriram Tallam Puranam Raghu , Étienne Buteau , Erik Scheme , Mounir Boukadoum , Alexandre Campeau-Lecours , Benoit GosselinComments: 11 pages, 9 figures, submitted to IEEE Transactions on Neural Networks and Learning SystemsSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: Current electromyography (EMG) pattern recognition (PR) models have been shown to generalize poorly in unconstrained environments, setting back their adoption in applications such as hand gesture control. This problem is often due to limited training data, exacerbated by the use of supervised classification frameworks that are known to be suboptimal in such settings. In this work, we propose a shift to deep metric-based meta-learning in EMG PR to supervise the creation of meaningful and interpretable representations. We use a Siamese Deep Convolutional Neural Network (SDCNN) and contrastive triplet loss to learn an EMG feature embedding space that captures the distribution of the different classes. A nearest-centroid approach is subsequently employed for inference, relying on how closely a test sample aligns with the established data distributions. We derive a robust class proximity-based confidence estimator that leads to a better rejection of incorrect decisions, i.e. false positives, especially when operating beyond the training data domain. We show our approach's efficacy by testing the trained SDCNN's predictions and confidence estimations on unseen data, both in and out of the training domain. The evaluation metrics include the accuracy-rejection curve and the Kullback-Leibler divergence between the confidence distributions of accurate and inaccurate predictions. Outperforming comparable models on both metrics, our results demonstrate that the proposed meta-learning approach improves the classifier's precision in active decisions (after rejection), thus leading to better generalization and applicability.
- [1946] arXiv:2404.15364 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: MP-DPD: Low-Complexity Mixed-Precision Neural Networks for Energy-Efficient Digital Predistortion of Wideband Power AmplifiersYizhuo Wu , Ang Li , Mohammadreza Beikmirza , Gagan Deep Singh , Qinyu Chen , Leo C. N. de Vreede , Morteza Alavi , Chang GaoComments: Accepted to IEEE Microwave and Wireless Technology Letters (MWTL)Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Digital Pre-Distortion (DPD) enhances signal quality in wideband RF power amplifiers (PAs). As signal bandwidths expand in modern radio systems, DPD's energy consumption increasingly impacts overall system efficiency. Deep Neural Networks (DNNs) offer promising advancements in DPD, yet their high complexity hinders their practical deployment. This paper introduces open-source mixed-precision (MP) neural networks that employ quantized low-precision fixed-point parameters for energy-efficient DPD. This approach reduces computational complexity and memory footprint, thereby lowering power consumption without compromising linearization efficacy. Applied to a 160MHz-BW 1024-QAM OFDM signal from a digital RF PA, MP-DPD gives no performance loss against 32-bit floating-point precision DPDs, while achieving -43.75 (L)/-45.27 (R) dBc in Adjacent Channel Power Ratio (ACPR) and -38.72 dB in Error Vector Magnitude (EVM). A 16-bit fixed-point-precision MP-DPD enables a 2.8X reduction in estimated inference power. The PyTorch learning and testing code is publicly available at \url{ this https URL }.
- [1947] arXiv:2404.15369 (cross-list from q-bio.NC) [ pdf , ps , html , other ]
-
Title: Can a Machine be Conscious? Towards Universal Criteria for Machine ConsciousnessComments: This work was supported by the UKRI CDT in AI for Healthcare, this http URL (Grant No. EP/S023283/1)Subjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: As artificially intelligent systems become more anthropomorphic and pervasive, and their potential impact on humanity more urgent, discussions about the possibility of machine consciousness have significantly intensified, and it is sometimes seen as 'the holy grail'. Many concerns have been voiced about the ramifications of creating an artificial conscious entity. This is compounded by a marked lack of consensus around what constitutes consciousness and by an absence of a universal set of criteria for determining consciousness. By going into depth on the foundations and characteristics of consciousness, we propose five criteria for determining whether a machine is conscious, which can also be applied more generally to any entity. This paper aims to serve as a primer and stepping stone for researchers of consciousness, be they in philosophy, computer science, medicine, or any other field, to further pursue this holy grail of philosophy, neuroscience and artificial intelligence.
- [1948] arXiv:2404.15370 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Self-Supervised Learning for User LocalizationSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Abstract: Machine learning techniques have shown remarkable accuracy in localization tasks, but their dependency on vast amounts of labeled data, particularly Channel State Information (CSI) and corresponding coordinates, remains a bottleneck. Self-supervised learning techniques alleviate the need for labeled data, a potential that remains largely untapped and underexplored in existing research. Addressing this gap, we propose a pioneering approach that leverages self-supervised pretraining on unlabeled data to boost the performance of supervised learning for user localization based on CSI. We introduce two pretraining Auto Encoder (AE) models employing Multi Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) to glean representations from unlabeled data via self-supervised learning. Following this, we utilize the encoder portion of the AE models to extract relevant features from labeled data, and finetune an MLP-based Position Estimation Model to accurately deduce user locations. Our experimentation on the CTW-2020 dataset, which features a substantial volume of unlabeled data but limited labeled samples, demonstrates the viability of our approach. Notably, the dataset covers a vast area spanning over 646x943x41 meters, and our approach demonstrates promising results even for such expansive localization tasks.
- [1949] arXiv:2404.15371 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Efficient Verification of a RADAR SoC Using Formal and Simulation-Based MethodsComments: Published in DVCon Europe 2023Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: As the demand for Internet of Things (IoT) and Human-to-Machine Interaction (HMI) increases, modern System-on-Chips (SoCs) offering such solutions are becoming increasingly complex. This intricate design poses significant challenges for verification, particularly when time-to-market is a crucial factor for consumer electronics products. This paper presents a case study based on our work to verify a complex Radio Detection And Ranging (RADAR) based SoC that performs on-chip sensing of human motion with millimetre accuracy. We leverage both formal and simulation-based methods to complement each other and achieve verification sign-off with high confidence. While employing a requirements-driven flow approach, we demonstrate the use of different verification methods to cater to multiple requirements and highlight our know-how from the project. Additionally, we used Machine Learning (ML) based methods, specifically the Xcelium ML tool from Cadence, to improve verification throughput.
- [1950] arXiv:2404.15373 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Robust EEG-based Emotion Recognition Using an Inception and Two-sided Perturbation ModelSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automated emotion recognition using electroencephalogram (EEG) signals has gained substantial attention. Although deep learning approaches exhibit strong performance, they often suffer from vulnerabilities to various perturbations, like environmental noise and adversarial attacks. In this paper, we propose an Inception feature generator and two-sided perturbation (INC-TSP) approach to enhance emotion recognition in brain-computer interfaces. INC-TSP integrates the Inception module for EEG data analysis and employs two-sided perturbation (TSP) as a defensive mechanism against input perturbations. TSP introduces worst-case perturbations to the model's weights and inputs, reinforcing the model's elasticity against adversarial attacks. The proposed approach addresses the challenge of maintaining accurate emotion recognition in the presence of input uncertainties. We validate INC-TSP in a subject-independent three-class emotion recognition scenario, demonstrating robust performance.
- [1951] arXiv:2404.15378 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Hierarchical Hybrid Sliced Wasserstein: A Scalable Metric for Heterogeneous Joint DistributionsComments: 28 pages, 11 figures, 4 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: Sliced Wasserstein (SW) and Generalized Sliced Wasserstein (GSW) have been widely used in applications due to their computational and statistical scalability. However, the SW and the GSW are only defined between distributions supported on a homogeneous domain. This limitation prevents their usage in applications with heterogeneous joint distributions with marginal distributions supported on multiple different domains. Using SW and GSW directly on the joint domains cannot make a meaningful comparison since their homogeneous slicing operator i.e., Radon Transform (RT) and Generalized Radon Transform (GRT) are not expressive enough to capture the structure of the joint supports set. To address the issue, we propose two new slicing operators i.e., Partial Generalized Radon Transform (PGRT) and Hierarchical Hybrid Radon Transform (HHRT). In greater detail, PGRT is the generalization of Partial Radon Transform (PRT), which transforms a subset of function arguments non-linearly while HHRT is the composition of PRT and multiple domain-specific PGRT on marginal domain arguments. By using HHRT, we extend the SW into Hierarchical Hybrid Sliced Wasserstein (H2SW) distance which is designed specifically for comparing heterogeneous joint distributions. We then discuss the topological, statistical, and computational properties of H2SW. Finally, we demonstrate the favorable performance of H2SW in 3D mesh deformation, deep 3D mesh autoencoders, and datasets comparison.
- [1952] arXiv:2404.15379 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Clustering of timed sequences -- Application to the analysis of care pathwaysSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Improving the future of healthcare starts by better understanding the current actual practices in hospitals. This motivates the objective of discovering typical care pathways from patient data. Revealing homogeneous groups of care pathways can be achieved through clustering. The difficulty in clustering care pathways, represented by sequences of timestamped events, lies in defining a semantically appropriate metric and clustering algorithms.
In this article, we adapt two methods developed for time series to time sequences: the drop-DTW metric and the DBA approach for the construction of averaged time sequences. These methods are then applied in clustering algorithms to propose original and sound clustering algorithms for timed sequences.
This approach is experimented with and evaluated on synthetic and real use cases. - [1953] arXiv:2404.15380 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ControlTraj: Controllable Trajectory Generation with Topology-Constrained Diffusion ModelYuanshao Zhu , James Jianqiao Yu , Xiangyu Zhao , Qidong Liu , Yongchao Ye , Wei Chen , Zijian Zhang , Xuetao Wei , Yuxuan LiangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Generating trajectory data is among promising solutions to addressing privacy concerns, collection costs, and proprietary restrictions usually associated with human mobility analyses. However, existing trajectory generation methods are still in their infancy due to the inherent diversity and unpredictability of human activities, grappling with issues such as fidelity, flexibility, and generalizability. To overcome these obstacles, we propose ControlTraj, a Controllable Trajectory generation framework with the topology-constrained diffusion model. Distinct from prior approaches, ControlTraj utilizes a diffusion model to generate high-fidelity trajectories while integrating the structural constraints of road network topology to guide the geographical outcomes. Specifically, we develop a novel road segment autoencoder to extract fine-grained road segment embedding. The encoded features, along with trip attributes, are subsequently merged into the proposed geographic denoising UNet architecture, named GeoUNet, to synthesize geographic trajectories from white noise. Through experimentation across three real-world data settings, ControlTraj demonstrates its ability to produce human-directed, high-fidelity trajectory generation with adaptability to unexplored geographical contexts.
- [1954] arXiv:2404.15381 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Advances and Open Challenges in Federated Learning with Foundation ModelsChao Ren , Han Yu , Hongyi Peng , Xiaoli Tang , Anran Li , Yulan Gao , Alysa Ziying Tan , Bo Zhao , Xiaoxiao Li , Zengxiang Li , Qiang YangComments: Survey of Federated Foundation Models (FedFM)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The integration of Foundation Models (FMs) with Federated Learning (FL) presents a transformative paradigm in Artificial Intelligence (AI), offering enhanced capabilities while addressing concerns of privacy, data decentralization, and computational efficiency. This paper provides a comprehensive survey of the emerging field of Federated Foundation Models (FedFM), elucidating their synergistic relationship and exploring novel methodologies, challenges, and future directions that the FL research field needs to focus on in order to thrive in the age of foundation models. A systematic multi-tiered taxonomy is proposed, categorizing existing FedFM approaches for model training, aggregation, trustworthiness, and incentivization. Key challenges, including how to enable FL to deal with high complexity of computational demands, privacy considerations, contribution evaluation, and communication efficiency, are thoroughly discussed. Moreover, the paper explores the intricate challenges of communication, scalability and security inherent in training/fine-tuning FMs via FL, highlighting the potential of quantum computing to revolutionize the training, inference, optimization and data encryption processes. This survey underscores the importance of further research to propel innovation in FedFM, emphasizing the need for developing trustworthy solutions. It serves as a foundational guide for researchers and practitioners interested in contributing to this interdisciplinary and rapidly advancing field.
- [1955] arXiv:2404.15382 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Feature Distribution Shift Mitigation with Contrastive Pretraining for Intrusion DetectionWeixing Wang , Haojin Yang , Christoph Meinel , Hasan Yagiz Özkan , Cristian Bermudez Serna , Carmen Mas-MachucaComments: accepted by ICMLCN24Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract: In recent years, there has been a growing interest in using Machine Learning (ML), especially Deep Learning (DL) to solve Network Intrusion Detection (NID) problems. However, the feature distribution shift problem remains a difficulty, because the change in features' distributions over time negatively impacts the model's performance. As one promising solution, model pretraining has emerged as a novel training paradigm, which brings robustness against feature distribution shift and has proven to be successful in Computer Vision (CV) and Natural Language Processing (NLP). To verify whether this paradigm is beneficial for NID problem, we propose SwapCon, a ML model in the context of NID, which compresses shift-invariant feature information during the pretraining stage and refines during the finetuning stage. We exemplify the evidence of feature distribution shift using the Kyoto2006+ dataset. We demonstrate how pretraining a model with the proper size can increase robustness against feature distribution shifts by over 8%. Moreover, we show how an adequate numerical embedding strategy also enhances the performance of pretrained models. Further experiments show that the proposed SwapCon model also outperforms eXtreme Gradient Boosting (XGBoost) and K-Nearest Neighbor (KNN) based models by a large margin.
- [1956] arXiv:2404.15383 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: WANDR: Intention-guided Human Motion GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this, we introduce WANDR, a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this, we introduce novel intention features that drive rich goal-oriented movement. Intention guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially, intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations. Our models and code are available for research purposes at this http URL .
- [1957] arXiv:2404.15384 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FL-TAC: Enhanced Fine-Tuning in Federated Learning via Low-Rank, Task-Specific Adapter ClusteringSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Although large-scale pre-trained models hold great potential for adapting to downstream tasks through fine-tuning, the performance of such fine-tuned models is often limited by the difficulty of collecting sufficient high-quality, task-specific data. Federated Learning (FL) offers a promising solution by enabling fine-tuning across large-scale clients with a variety of task data, but it is bottlenecked by significant communication overhead due to the pre-trained models' extensive size. This paper addresses the high communication cost for fine-tuning large pre-trained models within FL frameworks through low-rank fine-tuning. Specifically, we train a low-rank adapter for each individual task on the client side, followed by server-side clustering for similar group of adapters to achieve task-specific aggregation. Extensive experiments on various language and vision tasks, such as GLUE and CIFAR-10/100, reveal the evolution of task-specific adapters throughout the FL training process and verify the effectiveness of the proposed low-rank task-specific adapter clustering (TAC) method.
- [1958] arXiv:2404.15385 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Sum of Group Error Differences: A Critical Examination of Bias Evaluation in Biometric Verification and a Dual-Metric MeasureAlaa Elobaid , Nathan Ramoly , Lara Younes , Symeon Papadopoulos , Eirini Ntoutsi , Ioannis KompatsiarisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Biometric Verification (BV) systems often exhibit accuracy disparities across different demographic groups, leading to biases in BV applications. Assessing and quantifying these biases is essential for ensuring the fairness of BV systems. However, existing bias evaluation metrics in BV have limitations, such as focusing exclusively on match or non-match error rates, overlooking bias on demographic groups with performance levels falling between the best and worst performance levels, and neglecting the magnitude of the bias present.
This paper presents an in-depth analysis of the limitations of current bias evaluation metrics in BV and, through experimental analysis, demonstrates their contextual suitability, merits, and limitations. Additionally, it introduces a novel general-purpose bias evaluation measure for BV, the ``Sum of Group Error Differences (SEDG)''. Our experimental results on controlled synthetic datasets demonstrate the effectiveness of demographic bias quantification when using existing metrics and our own proposed measure. We discuss the applicability of the bias evaluation metrics in a set of simulated demographic bias scenarios and provide scenario-based metric recommendations. Our code is publicly available under \url{ this https URL }. - [1959] arXiv:2404.15386 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Large-Scale Multipurpose Benchmark Datasets For Assessing Data-Driven Deep Learning Approaches For Water Distribution NetworksComments: Presented at WDSA CCWI, Ferrara, Italy, July 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Currently, the number of common benchmark datasets that researchers can use straight away for assessing data-driven deep learning approaches is very limited. Most studies provide data as configuration files. It is still up to each practitioner to follow a particular data generation method and run computationally intensive simulations to obtain usable data for model training and evaluation. In this work, we provide a collection of datasets that includes several small and medium size publicly available Water Distribution Networks (WDNs), including Anytown, Modena, Balerma, C-Town, D-Town, L-Town, Ky1, Ky6, Ky8, and Ky13. In total 1,394,400 hours of WDNs data operating under normal conditions is made available to the community.
- [1960] arXiv:2404.15388 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ML-based identification of the interface regions for coupling local and nonlocal modelsComments: 23 pages, 14 figures, research paperSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Local-nonlocal coupling approaches combine the computational efficiency of local models and the accuracy of nonlocal models. However, the coupling process is challenging, requiring expertise to identify the interface between local and nonlocal regions. This study introduces a machine learning-based approach to automatically detect the regions in which the local and nonlocal models should be used in a coupling approach. This identification process uses the loading functions and provides as output the selected model at the grid points. Training is based on datasets of loading functions for which reference coupling configurations are computed using accurate coupled solutions, where accuracy is measured in terms of the relative error between the solution to the coupling approach and the solution to the nonlocal model. We study two approaches that differ from one another in terms of the data structure. The first approach, referred to as the full-domain input data approach, inputs the full load vector and outputs a full label vector. In this case, the classification process is carried out globally. The second approach consists of a window-based approach, where loads are preprocessed and partitioned into windows and the problem is formulated as a node-wise classification approach in which the central point of each window is treated individually. The classification problems are solved via deep learning algorithms based on convolutional neural networks. The performance of these approaches is studied on one-dimensional numerical examples using F1-scores and accuracy metrics. In particular, it is shown that the windowing approach provides promising results, achieving an accuracy of 0.96 and an F1-score of 0.97. These results underscore the potential of the approach to automate coupling processes, leading to more accurate and computationally efficient solutions for material science applications.
- [1961] arXiv:2404.15390 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Uncertainty in latent representations of variational autoencoders optimized for visual tasksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep learning methods are increasingly becoming instrumental as modeling tools in computational neuroscience, employing optimality principles to build bridges between neural responses and perception or behavior. Developing models that adequately represent uncertainty is however challenging for deep learning methods, which often suffer from calibration problems. This constitutes a difficulty in particular when modeling cortical circuits in terms of Bayesian inference, beyond single point estimates such as the posterior mean or the maximum a posteriori. In this work we systematically studied uncertainty representations in latent representations of variational auto-encoders (VAEs), both in a perceptual task from natural images and in two other canonical tasks of computer vision, finding a poor alignment between uncertainty and informativeness or ambiguities in the images. We next showed how a novel approach which we call explaining-away variational auto-encoders (EA-VAEs), fixes these issues, producing meaningful reports of uncertainty in a variety of scenarios, including interpolation, image corruption, and even out-of-distribution detection. We show EA-VAEs may prove useful both as models of perception in computational neuroscience and as inference tools in computer vision.
- [1962] arXiv:2404.15392 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Na\"ive Bayes and Random Forest for Crop Yield PredictionAbbas Maazallahi , Sreehari Thota , Naga Prasad Kondaboina , Vineetha Muktineni , Deepthi Annem , Abhi Stephen Rokkam , Mohammad Hossein Amini , Mohammad Amir Salari , Payam Norouzzadeh , Eli Snir , Bahareh RahmaniSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This study analyzes crop yield prediction in India from 1997 to 2020, focusing on various crops and key environmental factors. It aims to predict agricultural yields by utilizing advanced machine learning techniques like Linear Regression, Decision Tree, KNN, Naïve Bayes, K-Mean Clustering, and Random Forest. The models, particularly Naïve Bayes and Random Forest, demonstrate high effectiveness, as shown through data visualizations. The research concludes that integrating these analytical methods significantly enhances the accuracy and reliability of crop yield predictions, offering vital contributions to agricultural data science.
- [1963] arXiv:2404.15406 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMsDavide Caffagni , Federico Cocchi , Nicholas Moratelli , Sara Sarto , Marcella Cornia , Lorenzo Baraldi , Rita CucchiaraComments: CVPR 2024 Workshop on What is Next in Multimodal Foundation ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
- [1964] arXiv:2404.15410 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Planning the path with Reinforcement Learning: Optimal Robot Motion Planning in RoboCup Small Size League EnvironmentsMateus G. Machado , João G. Melo , Cleber Zanchettin , Pedro H. M. Braga , Pedro V. Cunha , Edna N. S. Barros , Hansenclever F. BassaniComments: 12 pages, 3 figures, 3 tablesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This work investigates the potential of Reinforcement Learning (RL) to tackle robot motion planning challenges in the dynamic RoboCup Small Size League (SSL). Using a heuristic control approach, we evaluate RL's effectiveness in obstacle-free and single-obstacle path-planning environments. Ablation studies reveal significant performance improvements. Our method achieved a 60% time gain in obstacle-free environments compared to baseline algorithms. Additionally, our findings demonstrated dynamic obstacle avoidance capabilities, adeptly navigating around moving blocks. These findings highlight the potential of RL to enhance robot motion planning in the challenging and unpredictable SSL environment.
- [1965] arXiv:2404.15417 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The Power of Resets in Online Reinforcement LearningComments: Fixed a small typoSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach:
- We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions.
- As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access.
The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation. - [1966] arXiv:2404.15418 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Machine Learning Techniques with Fairness for Prediction of Completion of Drug and Alcohol RehabilitationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The aim of this study is to look at predicting whether a person will complete a drug and alcohol rehabilitation program and the number of times a person attends. The study is based on demographic data obtained from Substance Abuse and Mental Health Services Administration (SAMHSA) from both admissions and discharge data from drug and alcohol rehabilitation centers in Oklahoma. Demographic data is highly categorical which led to binary encoding being used and various fairness measures being utilized to mitigate bias of nine demographic variables. Kernel methods such as linear, polynomial, sigmoid, and radial basis functions were compared using support vector machines at various parameter ranges to find the optimal values. These were then compared to methods such as decision trees, random forests, and neural networks. Synthetic Minority Oversampling Technique Nominal (SMOTEN) for categorical data was used to balance the data with imputation for missing data. The nine bias variables were then intersectionalized to mitigate bias and the dual and triple interactions were integrated to use the probabilities to look at worst case ratio fairness mitigation. Disparate Impact, Statistical Parity difference, Conditional Statistical Parity Ratio, Demographic Parity, Demographic Parity Ratio, Equalized Odds, Equalized Odds Ratio, Equal Opportunity, and Equalized Opportunity Ratio were all explored at both the binary and multiclass scenarios.
- [1967] arXiv:2404.15420 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: XC-Cache: Cross-Attending to Cached Context for Efficient LLM InferenceJoão Monteiro , Étienne Marcotte , Pierre-André Noël , Valentina Zantedeschi , David Vázquez , Nicolas Chapados , Christopher Pal , Perouz TaslakianSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
- [1968] arXiv:2404.15447 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GLoD: Composing Global Contexts and Local Details in Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.
- [1969] arXiv:2404.15449 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{ this https URL }}
- [1970] arXiv:2404.15485 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Large Language Models Spot Phishing Emails with Surprising Accuracy: A Comparative Analysis of PerformanceComments: 7 pages, 3 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological elements like urgency, fear, social proof, and other manipulative strategies, phishers can lure individuals into revealing sensitive and personalized information. Building on this pervasive issue within modern technology, this paper aims to analyze the effectiveness of 15 Large Language Models (LLMs) in detecting phishing attempts, specifically focusing on a randomized set of "419 Scam" emails. The objective is to determine which LLMs can accurately detect phishing emails by analyzing a text file containing email metadata based on predefined criteria. The experiment concluded that the following models, ChatGPT 3.5, GPT-3.5-Turbo-Instruct, and ChatGPT, were the most effective in detecting phishing emails.
- [1971] arXiv:2404.15488 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical AgentsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: In natural language processing applied to the clinical domain, utilizing large language models has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing, analyzing, and taking action, generating trajectories to guide the search to target a potential error in the clinical notes. Subsequently, the MedEval agent employs five evaluators to assess the targeted error and the proposed correction. In cases where MedReAct's actions prove insufficient, the MedReFlex agent intervenes, engaging in reflective analysis and proposing alternative strategies. Finally, the MedFinalParser agent formats the final output, preserving the original style while ensuring the integrity of the error correction process. One core component of our method is our RAG pipeline based on our ClinicalCorp corpora. Among other well-known sources containing clinical guidelines and information, we preprocess and release the open-source MedWiki dataset for clinical RAG application. Our results demonstrate the central role of our RAG approach with ClinicalCorp leveraged through the MedReAct'N'MedReFlex framework. It achieved the ninth rank on the MEDIQA-CORR 2024 final leaderboard.
- [1972] arXiv:2404.15501 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic InformationComments: 11 pages, 9 tables, 3 figures, to be published in LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.
- [1973] arXiv:2404.15503 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedGreen: Carbon-aware Federated Learning with Model Size AdaptationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated learning (FL) provides a promising collaborative framework to build a model from distributed clients, and this work investigates the carbon emission of the FL process. Cloud and edge servers hosting FL clients may exhibit diverse carbon footprints influenced by their geographical locations with varying power sources, offering opportunities to reduce carbon emissions by training local models with adaptive computations and communications. In this paper, we propose FedGreen, a carbon-aware FL approach to efficiently train models by adopting adaptive model sizes shared with clients based on their carbon profiles and locations using ordered dropout as a model compression technique. We theoretically analyze the trade-offs between the produced carbon emissions and the convergence accuracy, considering the carbon intensity discrepancy across countries to choose the parameters optimally. Empirical studies show that FedGreen can substantially reduce the carbon footprints of FL compared to the state-of-the-art while maintaining competitive model accuracy.
- [1974] arXiv:2404.15515 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ToM-LM: Delegating Theory of Mind Reasoning to External Symbolic Executors in Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others. While Large Language Models (LLMs) have shown some promise with ToM ability, they still struggle with complex ToM reasoning. Our approach leverages an external symbolic executor, specifically the SMCDEL model checker, and fine-tuning to improve the ToM reasoning ability of LLMs. In our approach, an LLM is first fine-tuned through pairs of natural language and symbolic formulation representation of ToM problems and is then instructed to generate the symbolic formulation with a one-shot in-context example. The generated symbolic formulation is then executed by the SMCDEL model checker to perform transparent and verifiable ToM reasoning and give the final result. We demonstrate that our approach, ToM-LM, shows a significant improvement over all the constructed baselines. Our study proposes a novel view about externalizing a particular component of ToM reasoning, mainly reasoning about beliefs, and suggests generalizing it to other aspects of ToM reasoning.
- [1975] arXiv:2404.15516 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image RetrievalComments: 15 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
- [1976] arXiv:2404.15518 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.
- [1977] arXiv:2404.15522 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language ModelsMihir Parmar , Nisarg Patel , Neeraj Varshney , Mutsumi Nakamura , Man Luo , Santosh Mashetty , Arindam Mitra , Chitta BaralComments: 29 PagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at this https URL .
- [1978] arXiv:2404.15532 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical AnalysisShuhang Lin , Wenyue Hua , Lingyao Li , Che-Jui Chang , Lizhou Fan , Jianchao Ji , Hang Hua , Mingyu Jin , Jiebo Luo , Yongfeng ZhangComments: 26 pages, 14 figures The data and code for this project are accessible at this https URLSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
Abstract: This paper presents BattleAgent, an emulation system that combines the Large Vision-Language Model and Multi-agent System. This novel system aims to simulate complex dynamic interactions among multiple agents, as well as between agents and their environments, over a period of time. It emulates both the decision-making processes of leaders and the viewpoints of ordinary participants, such as soldiers. The emulation showcases the current capabilities of agents, featuring fine-grained multi-modal interactions between agents and landscapes. It develops customizable agent structures to meet specific situational requirements, for example, a variety of battle-related activities like scouting and trench digging. These components collaborate to recreate historical events in a lively and comprehensive manner while offering insights into the thoughts and feelings of individuals from diverse viewpoints. The technological foundations of BattleAgent establish detailed and immersive settings for historical battles, enabling individual agents to partake in, observe, and dynamically respond to evolving battle scenarios. This methodology holds the potential to substantially deepen our understanding of historical events, particularly through individual accounts. Such initiatives can also aid historical research, as conventional historical narratives often lack documentation and prioritize the perspectives of decision-makers, thereby overlooking the experiences of ordinary individuals. BattelAgent illustrates AI's potential to revitalize the human aspect in crucial social events, thereby fostering a more nuanced collective understanding and driving the progressive development of human society.
- [1979] arXiv:2404.15538 (cross-list from cs.GR) [ pdf , ps , html , other ]
-
Title: DreamCraft: Text-Guided Generation of Functional 3D Environments in MinecraftComments: 16 pages, 9 figures, accepted to Foundation of Digital Games 2024Subjects: Graphics (cs.GR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Procedural Content Generation (PCG) algorithms enable the automatic generation of complex and diverse artifacts. However, they don't provide high-level control over the generated content and typically require domain expertise. In contrast, text-to-3D methods allow users to specify desired characteristics in natural language, offering a high amount of flexibility and expressivity. But unlike PCG, such approaches cannot guarantee functionality, which is crucial for certain applications like game design. In this paper, we present a method for generating functional 3D artifacts from free-form text prompts in the open-world game Minecraft. Our method, DreamCraft, trains quantized Neural Radiance Fields (NeRFs) to represent artifacts that, when viewed in-game, match given text descriptions. We find that DreamCraft produces more aligned in-game artifacts than a baseline that post-processes the output of an unconstrained NeRF. Thanks to the quantized representation of the environment, functional constraints can be integrated using specialized loss terms. We show how this can be leveraged to generate 3D structures that match a target distribution or obey certain adjacency rules over the block types. DreamCraft inherits a high degree of expressivity and controllability from the NeRF, while still being able to incorporate functional constraints through domain-specific objectives.
- [1980] arXiv:2404.15549 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language ModelsShashi Kant Gupta , Aditya Basu , Mauro Nievas , Jerrin Thomas , Nathan Wolfrath , Adhitya Ramamurthi , Bradley Taylor , Anai N. Kothari , Regina Schwind , Therica M. Miller , Sorena Nadaf-Rahrov , Yanshan Wang , Hrituraj SinghComments: 30 Pages, 8 Figures, Supplementary Work AttachedSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.
- [1981] arXiv:2404.15564 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Guided AbsoluteGrad: Magnitude of Gradients Matters to Explanation's Localization and SaliencyComments: CAI2024 Camera-ready SubmissionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: This paper proposes a new gradient-based XAI method called Guided AbsoluteGrad for saliency map explanations. We utilize both positive and negative gradient magnitudes and employ gradient variance to distinguish the important areas for noise deduction. We also introduce a novel evaluation metric named ReCover And Predict (RCAP), which considers the Localization and Visual Noise Level objectives of the explanations. We propose two propositions for these two objectives and prove the necessity of evaluating them. We evaluate Guided AbsoluteGrad with seven gradient-based XAI methods using the RCAP metric and other SOTA metrics in three case studies: (1) ImageNet dataset with ResNet50 model; (2) International Skin Imaging Collaboration (ISIC) dataset with EfficientNet model; (3) the Places365 dataset with DenseNet161 model. Our method surpasses other gradient-based approaches, showcasing the quality of enhanced saliency map explanations through gradient magnitude.
- [1982] arXiv:2404.15592 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value ExtractionHenry Peng Zou , Vinay Samuel , Yue Zhou , Weizhi Zhang , Liancheng Fang , Zihe Song , Philip S. Yu , Cornelia CarageaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Existing datasets for attribute value extraction (AVE) predominantly focus on explicit attribute values while neglecting the implicit ones, lack product images, are often not publicly available, and lack an in-depth human inspection across diverse domains. To address these limitations, we present ImplicitAVE, the first, publicly available multimodal dataset for implicit attribute value extraction. ImplicitAVE, sourced from the MAVE dataset, is carefully curated and expanded to include implicit AVE and multimodality, resulting in a refined dataset of 68k training and 1.6k testing data across five domains. We also explore the application of multimodal large language models (MLLMs) to implicit AVE, establishing a comprehensive benchmark for MLLMs on the ImplicitAVE dataset. Six recent MLLMs with eleven variants are evaluated across diverse settings, revealing that implicit value extraction remains a challenging task for MLLMs. The contributions of this work include the development and release of ImplicitAVE, and the exploration and benchmarking of various MLLMs for implicit AVE, providing valuable insights and potential future research directions. Dataset and code are available at this https URL
- [1983] arXiv:2404.15597 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: GRSN: Gated Recurrent Spiking Neurons for POMDPs and MARLSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50% power consumption.
- [1984] arXiv:2404.15599 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Human-in-the-loop Learning for Dynamic Congestion GamesComments: This paper has been accepted by IEEE Transactions on Mobile Computing (2024)Subjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: Today mobile users learn and share their traffic observations via crowdsourcing platforms (e.g., Waze). Yet such platforms simply cater to selfish users' myopic interests to recommend the shortest path, and do not encourage enough users to travel and learn other paths for future others. Prior studies focus on one-shot congestion games without considering users' information learning, while our work studies how users learn and alter traffic conditions on stochastic paths in a human-in-the-loop manner. Our analysis shows that the myopic routing policy leads to severe under-exploration of stochastic paths. This results in a price of anarchy (PoA) greater than $2$, as compared to the socially optimal policy in minimizing the long-term social cost. Besides, the myopic policy fails to ensure the correct learning convergence about users' traffic hazard beliefs. To address this, we focus on informational (non-monetary) mechanisms as they are easier to implement than pricing. We first show that existing information-hiding mechanisms and deterministic path-recommendation mechanisms in Bayesian persuasion literature do not work with even (\text{PoA}=\infty). Accordingly, we propose a new combined hiding and probabilistic recommendation (CHAR) mechanism to hide all information from a selected user group and provide state-dependent probabilistic recommendations to the other user group. Our CHAR successfully ensures PoA less than (\frac{5}{4}), which cannot be further reduced by any other informational (non-monetary) mechanism. Besides the parallel network, we further extend our analysis and CHAR to more general linear path graphs with multiple intermediate nodes, and we prove that the PoA results remain unchanged. Additionally, we carry out experiments with real-world datasets to further extend our routing graphs and verify the close-to-optimal performance of our CHAR.
- [1985] arXiv:2404.15601 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Deepfakes and Higher Education: A Research Agenda and Scoping Review of Synthetic MediaJasper Roe (1), Mike Perkins (2) ((1) James Cook University Singapore, (2) British University Vietnam)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The availability of software which can produce convincing yet synthetic media poses both threats and benefits to tertiary education globally. While other forms of synthetic media exist, this study focuses on deepfakes, which are advanced Generative AI (GenAI) fakes of real people. This conceptual paper assesses the current literature on deepfakes across multiple disciplines by conducting an initial scoping review of 182 peer-reviewed publications.
The review reveals three major trends: detection methods, malicious applications, and potential benefits, although no specific studies on deepfakes in the tertiary educational context were found. Following a discussion of these trends, this study applies the findings to postulate the major risks and potential mitigation strategies of deepfake technologies in higher education, as well as potential beneficial uses to aid the teaching and learning of both deepfakes and synthetic media. This culminates in the proposal of a research agenda to build a comprehensive, cross-cultural approach to investigate deepfakes in higher education. - [1986] arXiv:2404.15604 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Hybrid LLM/Rule-based Approaches to Business Insights Generation from Structured DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In the field of business data analysis, the ability to extract actionable insights from vast and varied datasets is essential for informed decision-making and maintaining a competitive edge. Traditional rule-based systems, while reliable, often fall short when faced with the complexity and dynamism of modern business data. Conversely, Artificial Intelligence (AI) models, particularly Large Language Models (LLMs), offer significant potential in pattern recognition and predictive analytics but can lack the precision necessary for specific business applications. This paper explores the efficacy of hybrid approaches that integrate the robustness of rule-based systems with the adaptive power of LLMs in generating actionable business insights.
- [1987] arXiv:2404.15616 (cross-list from quant-ph) [ pdf , ps , html , other ]
-
Title: A Bi-directional Quantum Search AlgorithmComments: 7 pagesSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI)
Abstract: Grover's search algorithms, including various partial Grover searches, experience scaling problems as the number of iterations rises with increased qubits, making implementation more computationally expensive. This paper combines Partial Grover's search algorithm and Bi-directional Search to create a fast Grover's quantum search algorithm, referred to as Bi-Directional Grover Search (BDGS). We incorporated a bi-directional search tactic with a partial Grover search, starting from an initial state and a single marked state in parallel. We have shown in this article that our novel approach requires $\frac{\pi}{4\sqrt{2}}\sqrt{N}(1-\sqrt{\frac{1}{b^{r/2k}}})$ iterations over regular Grover Search and Partial Grover Search (PGS), which takes $\frac{\pi}{4}\sqrt{N}\sqrt{1-\frac{1}{b}}$ (here, $N=2^r$ elements, $b$ is the branching factor of partial search, and $k= \lceil\log_2b \rceil$). The proposed BDGS algorithm is benchmarked against the state-of-the-art Depth-First Grover's Search (DFGS) and generic Grover's Search (GS) implementations for $2$ to $20$ qubits and provides promising results. The Qiskit Python implementation of the proposed BDGS algorithm is available on Github ( this https URL ).
- [1988] arXiv:2404.15617 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DPO: Differential reinforcement learning with application to optimal configuration searchComments: 24 pages, 1 figure, 2 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST)
Abstract: Reinforcement learning (RL) with continuous state and action spaces remains one of the most challenging problems within the field. Most current learning methods focus on integral identities such as value functions to derive an optimal strategy for the learning agent. In this paper, we instead study the dual form of the original RL formulation to propose the first differential RL framework that can handle settings with limited training samples and short-length episodes. Our approach introduces Differential Policy Optimization (DPO), a pointwise and stage-wise iteration method that optimizes policies encoded by local-movement operators. We prove a pointwise convergence estimate for DPO and provide a regret bound comparable with current theoretical works. Such pointwise estimate ensures that the learned policy matches the optimal path uniformly across different steps. We then apply DPO to a class of practical RL problems which search for optimal configurations with Lagrangian rewards. DPO is easy to implement, scalable, and shows competitive results on benchmarking experiments against several popular RL methods.
- [1989] arXiv:2404.15633 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Artificial Intelligence for Multi-Unit Auction designSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
Abstract: Understanding bidding behavior in multi-unit auctions remains an ongoing challenge for researchers. Despite their widespread use, theoretical insights into the bidding behavior, revenue ranking, and efficiency of commonly used multi-unit auctions are limited. This paper utilizes artificial intelligence, specifically reinforcement learning, as a model free learning approach to simulate bidding in three prominent multi-unit auctions employed in practice. We introduce six algorithms that are suitable for learning and bidding in multi-unit auctions and compare them using an illustrative example. This paper underscores the significance of using artificial intelligence in auction design, particularly in enhancing the design of multi-unit auctions.
- [1990] arXiv:2404.15638 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PriorNet: A Novel Lightweight Network with Multidimensional Interactive Attention for Efficient Image DehazingComments: 8 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Hazy images degrade visual quality, and dehazing is a crucial prerequisite for subsequent processing tasks. Most current dehazing methods rely on neural networks and face challenges such as high computational parameter pressure and weak generalization capabilities. This paper introduces PriorNet--a novel, lightweight, and highly applicable dehazing network designed to significantly improve the clarity and visual quality of hazy images while avoiding excessive detail extraction issues. The core of PriorNet is the original Multi-Dimensional Interactive Attention (MIA) mechanism, which effectively captures a wide range of haze characteristics, substantially reducing the computational load and generalization difficulties associated with complex systems. By utilizing a uniform convolutional kernel size and incorporating skip connections, we have streamlined the feature extraction process. Simplifying the number of layers and architecture not only enhances dehazing efficiency but also facilitates easier deployment on edge devices. Extensive testing across multiple datasets has demonstrated PriorNet's exceptional performance in dehazing and clarity restoration, maintaining image detail and color fidelity in single-image dehazing tasks. Notably, with a model size of just 18Kb, PriorNet showcases superior dehazing generalization capabilities compared to other methods. Our research makes a significant contribution to advancing image dehazing technology, providing new perspectives and tools for the field and related domains, particularly emphasizing the importance of improving universality and deployability.
- [1991] arXiv:2404.15648 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Affordance Blending NetworksComments: Early preprintSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Affordances, a concept rooted in ecological psychology and pioneered by James J. Gibson, have emerged as a fundamental framework for understanding the dynamic relationship between individuals and their environments. Expanding beyond traditional perceptual and cognitive paradigms, affordances represent the inherent effect and action possibilities that objects offer to the agents within a given context. As a theoretical lens, affordances bridge the gap between effect and action, providing a nuanced understanding of the connections between agents' actions on entities and the effect of these actions. In this study, we propose a model that unifies object, action and effect into a single latent representation in a common latent space that is shared between all affordances that we call the affordance space. Using this affordance space, our system is able to generate effect trajectories when action and object are given and is able to generate action trajectories when effect trajectories and objects are given. In the experiments, we showed that our model does not learn the behavior of each object but it learns the affordance relations shared by the objects that we call equivalences. In addition to simulated experiments, we showed that our model can be used for direct imitation in real world cases. We also propose affordances as a base for Cross Embodiment transfer to link the actions of different robots. Finally, we introduce selective loss as a solution that allows valid outputs to be generated for indeterministic model inputs.
- [1992] arXiv:2404.15653 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text DataSachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad RastegariSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{ this https URL }.
- [1993] arXiv:2404.15657 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedSI: Federated Subnetwork Inference for Efficient Uncertainty QuantificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: While deep neural networks (DNNs) based personalized federated learning (PFL) is demanding for addressing data heterogeneity and shows promising performance, existing methods for federated learning (FL) suffer from efficient systematic uncertainty quantification. The Bayesian DNNs-based PFL is usually questioned of either over-simplified model structures or high computational and memory costs. In this paper, we introduce FedSI, a novel Bayesian DNNs-based subnetwork inference PFL framework. FedSI is simple and scalable by leveraging Bayesian methods to incorporate systematic uncertainties effectively. It implements a client-specific subnetwork inference mechanism, selects network parameters with large variance to be inferred through posterior distributions, and fixes the rest as deterministic ones. FedSI achieves fast and scalable inference while preserving the systematic uncertainties to the fullest extent. Extensive experiments on three different benchmark datasets demonstrate that FedSI outperforms existing Bayesian and non-Bayesian FL baselines in heterogeneous FL scenarios.
- [1994] arXiv:2404.15667 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic ReviewsComments: Accepted to the International Conference on Evaluation and Assessment in Software Engineering (EASE), 2024 editionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.
- [1995] arXiv:2404.15676 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address various challenges across diverse domains and tasks involving LLMs. In this paper, we provide a comprehensive survey of Chain-of-X methods for LLMs in different contexts. Specifically, we categorize them by taxonomies of nodes, i.e., the X in CoX, and application tasks. We also discuss the findings and implications of existing CoX methods, as well as potential future directions. Our survey aims to serve as a detailed and up-to-date resource for researchers seeking to apply the idea of CoT to broader scenarios.
- [1996] arXiv:2404.15678 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Retrieval and Distill: A Temporal Data Shift-Free Paradigm for Online Recommendation SystemSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Current recommendation systems are significantly affected by a serious issue of temporal data shift, which is the inconsistency between the distribution of historical data and that of online data. Most existing models focus on utilizing updated data, overlooking the transferable, temporal data shift-free information that can be learned from shifting data. We propose the Temporal Invariance of Association theorem, which suggests that given a fixed search space, the relationship between the data and the data in the search space keeps invariant over time. Leveraging this principle, we designed a retrieval-based recommendation system framework that can train a data shift-free relevance network using shifting data, significantly enhancing the predictive performance of the original model in the recommendation system. However, retrieval-based recommendation models face substantial inference time costs when deployed online. To address this, we further designed a distill framework that can distill information from the relevance network into a parameterized module using shifting data. The distilled model can be deployed online alongside the original model, with only a minimal increase in inference time. Extensive experiments on multiple real datasets demonstrate that our framework significantly improves the performance of the original model by utilizing shifting data.
- [1997] arXiv:2404.15681 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer ModelsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 130,000 function re-write GPT output text blocks, approximately 40,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.
- [1998] arXiv:2404.15687 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Graph Neural Networks for Vulnerability Detection: A Counterfactual ExplanationComments: This paper was accepted in the proceedings of the 33nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Vulnerability detection is crucial for ensuring the security and reliability of software systems. Recently, Graph Neural Networks (GNNs) have emerged as a prominent code embedding approach for vulnerability detection, owing to their ability to capture the underlying semantic structure of source code. However, GNNs face significant challenges in explainability due to their inherently black-box nature. To this end, several factual reasoning-based explainers have been proposed. These explainers provide explanations for the predictions made by GNNs by analyzing the key features that contribute to the outcomes. We argue that these factual reasoning-based explanations cannot answer critical what-if questions: What would happen to the GNN's decision if we were to alter the code graph into alternative structures? Inspired by advancements of counterfactual reasoning in artificial intelligence, we propose CFExplainer, a novel counterfactual explainer for GNN-based vulnerability detection. Unlike factual reasoning-based explainers, CFExplainer seeks the minimal perturbation to the input code graph that leads to a change in the prediction, thereby addressing the what-if questions for vulnerability detection. We term this perturbation a counterfactual explanation, which can pinpoint the root causes of the detected vulnerability and furnish valuable insights for developers to undertake appropriate actions for fixing the vulnerability. Extensive experiments on four GNN-based vulnerability detection models demonstrate the effectiveness of CFExplainer over existing state-of-the-art factual reasoning-based explainers.
- [1999] arXiv:2404.15697 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DeepFeatureX Net: Deep Features eXtractors based Network for discriminating synthetic from real imagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deepfakes, synthetic images generated by deep learning algorithms, represent one of the biggest challenges in the field of Digital Forensics. The scientific community is working to develop approaches that can discriminate the origin of digital images (real or AI-generated). However, these methodologies face the challenge of generalization, that is, the ability to discern the nature of an image even if it is generated by an architecture not seen during training. This usually leads to a drop in performance. In this context, we propose a novel approach based on three blocks called Base Models, each of which is responsible for extracting the discriminative features of a specific image class (Diffusion Model-generated, GAN-generated, or real) as it is trained by exploiting deliberately unbalanced datasets. The features extracted from each block are then concatenated and processed to discriminate the origin of the input image. Experimental results showed that this approach not only demonstrates good robust capabilities to JPEG compression but also outperforms state-of-the-art methods in several generalization tests. Code, models and dataset are available at this https URL .
- [2000] arXiv:2404.15704 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Multi-Model Fusion with Adversarial Complementary Representation LearningComments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge, allowing each individual component model to learn maximally distinct, complementary representations. We make three detailed explanations of why this works and experimental results demonstrate that our method more efficiently improves performance compared to traditional MMF. Furthermore, attribution analysis validates the model trained under ACoRL acquires more complementary knowledge, highlighting the efficacy of our approach in enhancing efficiency and robustness across tasks.
- [2001] arXiv:2404.15714 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Ada-DF: An Adaptive Label Distribution Fusion Network For Facial Expression RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Facial expression recognition (FER) plays a significant role in our daily life. However, annotation ambiguity in the datasets could greatly hinder the performance. In this paper, we address FER task via label distribution learning paradigm, and develop a dual-branch Adaptive Distribution Fusion (Ada-DF) framework. One auxiliary branch is constructed to obtain the label distributions of samples. The class distributions of emotions are then computed through the label distributions of each emotion. Finally, those two distributions are adaptively fused according to the attention weights to train the target branch. Extensive experiments are conducted on three real-world datasets, RAF-DB, AffectNet and SFEW, where our Ada-DF shows advantages over the state-of-the-art works.
Total of 2408 entries :
2-2001
2001-2408